Age-standardized cancer mortality rates refer to the number of deaths attributed to cancer within a specific population over a given period, usually expressed as the number of deaths per 100,000 people adjusted for differences in age distribution. Monitoring cancer mortality rates allows public health authorities to track the burden of cancer, understand the prevalence of different cancer types and identify variations in different populations. Studying these metrics is essential for making accurate cross-country comparisons, identifying high-risk communities, informing public health policies, and supporting international efforts to address the global burden of cancer.
Datasets used for the analysis were separately gathered and consolidated from various sources including:
This study hypothesized that mortality rates by major cancer types contain inherent patterns and structures within the data, enabling the grouping of similar countries and the differentiation of dissimilar ones.
Subsequent analysis and modelling steps involving data understanding, data preparation, data exploration, model development, model validation and model presentation were individually detailed below, with all the results consolidated in a Summary provided at the end of the document.
The main objective of the study is to develop a clustering model with an optimal number of clusters that could recognize patterns and relationships among cancer mortality rates across countries, allowing for a deeper understanding of the inherent and underlying data structure when evaluated against supplementary information on lifestyle factors and geolocation.
Specific objectives are given as follows:
Obtain an optimal subset of observations and descriptors by conducting data quality assessment and feature selection, excluding cases or variables noted with irregularities and applying preprocessing operations most suitable for the downstream analysis
Develop multiple clustering models with optimized hyperparameters in terms of the number of clusters through through internal resampling validation
Select the final clustering model among candidates based on its ability to quantify the compactness and separation of clusters
Interpret clusters based on inter-cluster patterns and intra-cluster dissimilarities
Conduct a post-hoc exploration of the results to provide general insights on the relationship and association among and between the formulated clusters
Due to the unspervised learning nature of the analysis, there is no target variable defined for the study.
The clustering descriptors are the primary variables to be evaluated in formulating the clusters for segmenting the countries in the study.
Detailed descriptions for each individual clustering descriptor are provided as follows:
The target descriptors are the secondary variables to where the formulated clusters will be compared with, providing additional context to the findings.
Detailed descriptions for each individual target descriptor are provided as follows:
The metadata variables providing geolocation information for the study are:
Preliminary data used in the study was evaluated and prepared for analysis and modelling using the following methods:
Data Quality Assessment involves profiling and assessing the data to understand its suitability for machine learning tasks. The quality of training data has a huge impact on the efficiency, accuracy and complexity of machine learning tasks. Data remains susceptible to errors or irregularities that may be introduced during collection, aggregation or annotation stage. Issues such as incorrect labels, synonymous categories in a categorical variable or heterogeneity in columns, among others, which might go undetected by standard pre-processing modules in these frameworks can lead to sub-optimal model performance, inaccurate analysis and unreliable decisions.
Data Preprocessing involves changing the raw feature vectors into a representation that is more suitable for the downstream modelling and estimation processes, including data cleaning, integration, reduction and transformation. Data cleaning aims to identify and correct errors in the dataset that may negatively impact a predictive model such as removing outliers, replacing missing values, smoothing noisy data, and correcting inconsistent data. Data integration addresses potential issues with redundant and inconsistent data obtained from multiple sources through approaches such as detection of tuple duplication and data conflict. The purpose of data reduction is to have a condensed representation of the data set that is smaller in volume, while maintaining the integrity of the original data set. Data transformation converts the data into the most appropriate form for data modeling.
Data Exploration involves analyzing and investigating data sets to summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to discover patterns, spot anomalies, test a hypothesis, or check assumptions. This process is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a better understanding of data set variables and the relationships between them.
Yeo-Johnson Transformation applies a new family of distributions that can be used without restrictions, extending many of the good properties of the Box-Cox power family. Similar to the Box-Cox transformation, the method also estimates the optimal value of lambda but has the ability to transform both positive and negative values by inflating low variance data and deflating high variance data to create a more uniform data set. While there are no restrictions in terms of the applicable values, the interpretability of the transformed values is more diminished as compared to the other methods.
Statistical test measure assessed for the numeric descriptors in the study to determine the most optimal subset of variables for the subsequent modelling process included the following:
Pearson’s Correlation Coefficient is a parametric measure of the linear correlation for a pair of features by calculating the ratio between their covariance and the product of their standard deviations. The presence of high absolute correlation values indicate the univariate association among the numeric decriptors.
Cluster Analysis is a form of unsupervised learning method aimed at identifying similar structural patterns in an unlabeled data set by segmenting the observations into clusters with shared characteristics as compared to those in other clusters.
This study implemented clustering algorithms which formulated partitioned segments from the data set through the hierarchical (either agglomeratively when smaller clusters are merged into the larger clusters or divisively when larger clusters are divided into smaller clusters) and non-hierarchical (when each observation is placed in exactly one of the mutually exclusive clusters) methods. Models applied in the analysis for clustering high-dimensional were the following:
K-Means Clustering groups similar data points together into clusters by minimizing the mean distance between geometric points. The algorithm iteratively partitions data sets into a fixed number of non-overlapping k subgroups or clusters wherein each data point belongs to the cluster with the nearest mean cluster center. The process begins by initializing all the coordinates into a pre-defined k number of cluster centers. With every pass of the algorithm, each point is assigned to its nearest cluster center. The cluster centers are then updated to be the centers of all the points assigned to it in that pass. This is performed by re-calculating the cluster centers as the average of the points in each respective cluster. The algorithm repeats until there’s a minimum change of the cluster centers from the last iteration.
Bisecting K-Means Clustering is a variant of the traditional K-Means algorithm which iteratively splits clusters into two parts until the desired number of clusters is reached. It is a hierarchical clustering approach that uses a divisive strategy to build a hierarchy of clusters. The algorithm starts with the entire dataset as the initial cluster. The standard K-Means algorithm is implemented to the selected cluster, splitting it into two sub-clusters. Both steps are repeated until the desired number of clusters is reached. In cases when there are multiple clusters present, the algorithm selects the cluster with the largest variance. This results in a hierarchical structure of clusters, and the process can be stopped at any desired level of granularity.
Gaussian Mixture Clustering is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters incorporating information about the covariance structure of the data as well as the centers of the latent Gaussians. The algorithm involves initializing the parameters of the Gaussian components using K-means clustering to get initial estimates for the means and the identity matrix as a starting point for the covariance matrices. The expectation-maximization process is applied by calculating the probability of each data point belonging to each Gaussian component using the Bayes' theorem for the expectation step, and updating the parameters of the Gaussian components based on the weighted sum of the data points based on the probabilities determined for the maximization step. Convergence is checked by evaluating whether the log-likelihood of the data has stabilized or reached a maximum. Both steps are iterated until the criteria is met. After convergence, each data point is assigned to the cluster with the highest probability.
Agglomerative Clustering builds a hierarchy of clusters. In this algorithm, each data point starts as its own cluster, and the algorithm merges clusters iteratively until a stopping criterion is met. The algorithm starts with each data point as a singleton cluster with the number of initial clusters is equal to the number of data points. The pairwise distance matrix is calculated between all clusters using complete linkage determined as the maximum distance between any two points in the two clusters. The two clusters that have the minimum distance according to the linkage criterion are identified and merged in the next step. The distances between new clusters and all other clusters are recalculated. All previous steps are repeated until the desired number of clusters is reached or until a stopping criterion is met.
Ward Hierarchical Clustering creates compact, well-separated clusters by minimizing the variance within each cluster during the merging process. In this algorithm, each data point starts as its own cluster, and the algorithm merges clusters iteratively until a stopping criterion is met. The algorithm starts with each data point as a singleton cluster with the number of initial clusters is equal to the number of data points. The pairwise distance matrix is calculated between all clusters and used as a measure of dissimilarity. For each cluster, the within-cluster variance is computed which evaluates how tightly the data points within a cluster are grouped. The two clusters that, when merged, result in the smallest increase in the within-cluster variance are identified and merged in the next step. The within-cluster variance for the newly formed cluster are recalculated and the pairwise distance matrix updated. All previous steps are repeated until the desired number of clusters is reached or until a stopping criterion is met.
The optimal combination of hyperparameter values which maximized the performance of the various clustering models in the study used the following hyperparameter tuning strategy:
K-Fold Cross-Validation involves dividing the training set after a random shuffle into a user-defined K number of smaller non-overlapping sets called folds. Each unique fold is assigned as the hold-out test data to assess the model trained from the data set collected from all the remaining K-1 folds. The evaluation score is retained but the model is discarded. The process is recursively performed resulting to a total of K fitted models and evaluated on the K hold-out test sets. All K-computed performance measures reported from the process are then averaged to represent the estimated performance of the model. This approach can be computationally expensive and may be highly dependent on how the data was randomly assigned to their respective folds, but does not waste too much data which is a major advantage in problems where the number of samples is very small.
The segmentation performance of the formulated clustering models in the study were compared and evaluated using the following metrics:
Silhouette Score assesses the quality of clusters created by a clustering algorithm. It measures how well-separated the clusters are and how similar each data point in a cluster is to the other points in the same cluster compared to the nearest neighboring cluster. The silhouette score ranges from -1 to 1, where a higher value indicates better-defined clusters. The silhouette method requires the computation of the silhouette scores for each data point which is the average dissimilarity of the data point with all other data points in the next-nearest cluster minus the average dissimilarity of the data point to points in the same cluster and divided by the larger of the two numbers. The overall silhouette score for the clustering is the average of the silhouette scores for all data points.
Model presentation was conducted post-hoc to interpret the formulated clusters based on segmentation patterns and obtain insights based on their relationship and association. These methods were described as follows:
Cluster Visualization Plots are graphical representations designed to provide insight into the structure, distribution, and characteristics of clusters formed by clustering algorithms including but not limited to pair plots, heat maps and geographic plots. Pair plots (scatterplot matrices) can be used to visualize relationships between pairs of features within clusters. They provide a comprehensive view of feature distributions and correlations within and between clusters. Heatmaps are useful for visualizing the distribution of features within clusters. They provide a color-coded representation of feature values across clusters, allowing users to identify patterns and differences in the feature profiles of clusters. Geographic maps divide the space into regions based on geolocation data including latitude and longitude. Each region is color-coded to a specific cluster, providing a spatial representation of cluster boundaries.
##################################
# Installing shap package
##################################
# !pip install geopandas
##################################
# Setting the Python Environment
##################################
import os
os.environ["OMP_NUM_THREADS"] = '1'
##################################
# Loading Python Libraries
##################################
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import itertools
%matplotlib inline
from operator import add,mul,truediv
from sklearn.preprocessing import PowerTransformer, StandardScaler
from scipy import stats
from sklearn.model_selection import KFold
from sklearn.cluster import KMeans, AffinityPropagation, MeanShift, SpectralClustering, AgglomerativeClustering, Birch, BisectingKMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
import geopandas as gpd
##################################
# Setting Global Options
##################################
np.set_printoptions(suppress=True, precision=4)
pd.options.display.float_format = '{:.4f}'.format
##################################
# Loading the dataset
##################################
cancer_death_rate = pd.read_csv('CancerDeathsByCountryCode.csv')
##################################
# Performing a general exploration of the dataset
##################################
print('Dataset Dimensions: ')
display(cancer_death_rate.shape)
Dataset Dimensions:
(208, 16)
##################################
# Listing the column names and data types
##################################
print('Column Names and Data Types:')
display(cancer_death_rate.dtypes)
Column Names and Data Types:
COUNTRY object CODE object PROCAN float64 BRECAN float64 CERCAN float64 STOCAN float64 ESOCAN float64 PANCAN float64 LUNCAN float64 COLCAN float64 LIVCAN float64 SMPREV float64 OWPREV float64 ACSHAR float64 GEOLAT float64 GEOLON float64 dtype: object
##################################
# Taking a snapshot of the dataset
##################################
cancer_death_rate.head()
| COUNTRY | CODE | PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | SMPREV | OWPREV | ACSHAR | GEOLAT | GEOLON | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | AFG | 6.3700 | 8.6700 | 3.9000 | 29.3000 | 6.9600 | 2.7200 | 12.5300 | 8.4300 | 10.2700 | 11.9000 | 23.0000 | 0.2100 | 33.9391 | 67.7100 |
| 1 | Albania | ALB | 8.8700 | 6.5000 | 1.6400 | 10.6800 | 1.4400 | 6.6800 | 26.6300 | 9.1500 | 6.8400 | 20.5000 | 57.7000 | 7.1700 | 41.1533 | 20.1683 |
| 2 | Algeria | DZA | 5.3300 | 7.5800 | 2.1800 | 5.1000 | 1.1500 | 4.2700 | 10.4600 | 8.0500 | 2.2000 | 11.2000 | 62.0000 | 0.9500 | 28.0339 | 1.6596 |
| 3 | American Samoa | ASM | 20.9400 | 16.8100 | 5.0200 | 15.7900 | 1.5200 | 5.1900 | 28.0100 | 16.5500 | 7.0200 | NaN | NaN | NaN | -14.2710 | -170.1322 |
| 4 | Andorra | AND | 9.6800 | 9.0200 | 2.0400 | 8.3000 | 3.5600 | 10.2600 | 34.1800 | 22.9700 | 9.4400 | 26.6000 | 63.7000 | 11.0200 | 42.5462 | 1.6016 |
##################################
# Performing a general exploration of the numeric variables
##################################
if (len(cancer_death_rate.select_dtypes(include='number').columns)==0):
print('No numeric columns identified from the data.')
else:
print('Numeric Variable Summary:')
display(cancer_death_rate.describe(include='number').transpose())
Numeric Variable Summary:
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| PROCAN | 208.0000 | 11.7260 | 7.6965 | 2.8100 | 6.5875 | 10.0050 | 13.9900 | 54.1500 |
| BRECAN | 208.0000 | 11.3350 | 4.3649 | 4.6900 | 8.3975 | 10.5600 | 13.0950 | 37.1000 |
| CERCAN | 208.0000 | 6.0651 | 5.1204 | 0.7100 | 1.8575 | 4.4800 | 9.0575 | 39.9500 |
| STOCAN | 208.0000 | 10.5975 | 5.8993 | 3.4000 | 6.6350 | 9.1550 | 13.6725 | 46.0400 |
| ESOCAN | 208.0000 | 4.8946 | 4.1320 | 0.9600 | 2.3350 | 3.3100 | 5.4150 | 25.7600 |
| PANCAN | 208.0000 | 6.6004 | 3.0552 | 1.6000 | 4.2300 | 6.1150 | 8.7450 | 19.2900 |
| LUNCAN | 208.0000 | 21.0217 | 11.4489 | 5.9500 | 11.3800 | 20.0200 | 27.5125 | 78.2300 |
| COLCAN | 208.0000 | 13.6945 | 5.5475 | 4.9400 | 9.2775 | 12.7950 | 17.1325 | 31.3800 |
| LIVCAN | 208.0000 | 5.9826 | 9.0501 | 0.6500 | 2.8400 | 3.8950 | 6.0750 | 115.2300 |
| SMPREV | 186.0000 | 17.0140 | 8.0416 | 3.3000 | 10.4250 | 16.4000 | 22.8500 | 41.1000 |
| OWPREV | 191.0000 | 48.9963 | 17.0164 | 18.3000 | 31.2500 | 55.0000 | 60.9000 | 88.5000 |
| ACSHAR | 187.0000 | 6.0013 | 4.1502 | 0.0030 | 2.2750 | 5.7000 | 9.2500 | 20.5000 |
| GEOLAT | 208.0000 | 19.0381 | 24.3776 | -40.9006 | 4.1377 | 17.3443 | 40.0876 | 71.7069 |
| GEOLON | 208.0000 | 16.2690 | 71.9576 | -175.1982 | -11.1506 | 19.4388 | 47.8118 | 179.4144 |
##################################
# Performing a general exploration of the object variable
##################################
if (len(cancer_death_rate.select_dtypes(include='object').columns)==0):
print('No object columns identified from the data.')
else:
print('Object Variable Summary:')
display(cancer_death_rate.describe(include='object').transpose())
Object Variable Summary:
| count | unique | top | freq | |
|---|---|---|---|---|
| COUNTRY | 208 | 208 | Afghanistan | 1 |
| CODE | 203 | 203 | AFG | 1 |
##################################
# Performing a general exploration of the categorical variables
##################################
if (len(cancer_death_rate.select_dtypes(include='category').columns)==0):
print('No categorical columns identified from the data.')
else:
print('Categorical Variable Summary:')
display(cancer_rate.describe(include='category').transpose())
No categorical columns identified from the data.
Data quality findings based on assessment are as follows:
##################################
# Counting the number of duplicated rows
##################################
cancer_death_rate.duplicated().sum()
0
##################################
# Gathering the data types for each column
##################################
data_type_list = list(cancer_death_rate.dtypes)
##################################
# Gathering the variable names for each column
##################################
variable_name_list = list(cancer_death_rate.columns)
##################################
# Gathering the number of observations for each column
##################################
row_count_list = list([len(cancer_death_rate)] * len(cancer_death_rate.columns))
##################################
# Gathering the number of missing data for each column
##################################
null_count_list = list(cancer_death_rate.isna().sum(axis=0))
##################################
# Gathering the number of non-missing data for each column
##################################
non_null_count_list = list(cancer_death_rate.count())
##################################
# Gathering the missing data percentage for each column
##################################
fill_rate_list = map(truediv, non_null_count_list, row_count_list)
##################################
# Formulating the summary
# for all columns
##################################
all_column_quality_summary = pd.DataFrame(zip(variable_name_list,
data_type_list,
row_count_list,
non_null_count_list,
null_count_list,
fill_rate_list),
columns=['Column.Name',
'Column.Type',
'Row.Count',
'Non.Null.Count',
'Null.Count',
'Fill.Rate'])
display(all_column_quality_summary)
| Column.Name | Column.Type | Row.Count | Non.Null.Count | Null.Count | Fill.Rate | |
|---|---|---|---|---|---|---|
| 0 | COUNTRY | object | 208 | 208 | 0 | 1.0000 |
| 1 | CODE | object | 208 | 203 | 5 | 0.9760 |
| 2 | PROCAN | float64 | 208 | 208 | 0 | 1.0000 |
| 3 | BRECAN | float64 | 208 | 208 | 0 | 1.0000 |
| 4 | CERCAN | float64 | 208 | 208 | 0 | 1.0000 |
| 5 | STOCAN | float64 | 208 | 208 | 0 | 1.0000 |
| 6 | ESOCAN | float64 | 208 | 208 | 0 | 1.0000 |
| 7 | PANCAN | float64 | 208 | 208 | 0 | 1.0000 |
| 8 | LUNCAN | float64 | 208 | 208 | 0 | 1.0000 |
| 9 | COLCAN | float64 | 208 | 208 | 0 | 1.0000 |
| 10 | LIVCAN | float64 | 208 | 208 | 0 | 1.0000 |
| 11 | SMPREV | float64 | 208 | 186 | 22 | 0.8942 |
| 12 | OWPREV | float64 | 208 | 191 | 17 | 0.9183 |
| 13 | ACSHAR | float64 | 208 | 187 | 21 | 0.8990 |
| 14 | GEOLAT | float64 | 208 | 208 | 0 | 1.0000 |
| 15 | GEOLON | float64 | 208 | 208 | 0 | 1.0000 |
##################################
# Counting the number of columns
# with Fill.Rate < 1.00
##################################
len(all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<1)])
4
##################################
# Identifying the columns
# with Fill.Rate < 1.00
##################################
if (len(all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<1)])==0):
print('No columns with Fill.Rate < 1.00.')
else:
display(all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<1)].sort_values(by=['Fill.Rate'], ascending=True))
| Column.Name | Column.Type | Row.Count | Non.Null.Count | Null.Count | Fill.Rate | |
|---|---|---|---|---|---|---|
| 11 | SMPREV | float64 | 208 | 186 | 22 | 0.8942 |
| 13 | ACSHAR | float64 | 208 | 187 | 21 | 0.8990 |
| 12 | OWPREV | float64 | 208 | 191 | 17 | 0.9183 |
| 1 | CODE | object | 208 | 203 | 5 | 0.9760 |
##################################
# Identifying the columns
# with Fill.Rate < 1.00
##################################
column_low_fill_rate = all_column_quality_summary[(all_column_quality_summary['Fill.Rate']<1.00)]
##################################
# Gathering the metadata labels for each observation
##################################
row_metadata_list = cancer_death_rate["COUNTRY"].values.tolist()
##################################
# Gathering the number of columns for each observation
##################################
column_count_list = list([len(cancer_death_rate.columns)] * len(cancer_death_rate))
##################################
# Gathering the number of missing data for each row
##################################
null_row_list = list(cancer_death_rate.isna().sum(axis=1))
##################################
# Gathering the missing data percentage for each column
##################################
missing_rate_list = map(truediv, null_row_list, column_count_list)
##################################
# Identifying the rows
# with missing data
##################################
all_row_quality_summary = pd.DataFrame(zip(row_metadata_list,
column_count_list,
null_row_list,
missing_rate_list),
columns=['Row.Name',
'Column.Count',
'Null.Count',
'Missing.Rate'])
display(all_row_quality_summary)
| Row.Name | Column.Count | Null.Count | Missing.Rate | |
|---|---|---|---|---|
| 0 | Afghanistan | 16 | 0 | 0.0000 |
| 1 | Albania | 16 | 0 | 0.0000 |
| 2 | Algeria | 16 | 0 | 0.0000 |
| 3 | American Samoa | 16 | 3 | 0.1875 |
| 4 | Andorra | 16 | 0 | 0.0000 |
| ... | ... | ... | ... | ... |
| 203 | Vietnam | 16 | 0 | 0.0000 |
| 204 | Wales | 16 | 4 | 0.2500 |
| 205 | Yemen | 16 | 0 | 0.0000 |
| 206 | Zambia | 16 | 0 | 0.0000 |
| 207 | Zimbabwe | 16 | 0 | 0.0000 |
208 rows × 4 columns
##################################
# Counting the number of rows
# with Missing.Rate > 0.00
##################################
len(all_row_quality_summary[(all_row_quality_summary['Missing.Rate']>0.00)])
25
##################################
# Identifying the rows
# with Missing.Rate > 0.00
##################################
row_missing_rate = all_row_quality_summary[(all_row_quality_summary['Missing.Rate']>0.00)]
##################################
# Identifying the rows
# with Missing.Rate > 0.00
##################################
if (len(all_row_quality_summary[(all_row_quality_summary['Missing.Rate']>0.00)])==0):
print('No rows with Missing.Rate > 0.00.')
else:
display(all_row_quality_summary[(all_row_quality_summary['Missing.Rate']>0.00)].sort_values(by=['Missing.Rate'], ascending=False))
| Row.Name | Column.Count | Null.Count | Missing.Rate | |
|---|---|---|---|---|
| 204 | Wales | 16 | 4 | 0.2500 |
| 135 | Northern Ireland | 16 | 4 | 0.2500 |
| 57 | England | 16 | 4 | 0.2500 |
| 186 | Tokelau | 16 | 4 | 0.2500 |
| 161 | Scotland | 16 | 4 | 0.2500 |
| 198 | United States Virgin Islands | 16 | 3 | 0.1875 |
| 173 | South Sudan | 16 | 3 | 0.1875 |
| 158 | San Marino | 16 | 3 | 0.1875 |
| 149 | Puerto Rico | 16 | 3 | 0.1875 |
| 20 | Bermuda | 16 | 3 | 0.1875 |
| 3 | American Samoa | 16 | 3 | 0.1875 |
| 118 | Monaco | 16 | 3 | 0.1875 |
| 74 | Guam | 16 | 3 | 0.1875 |
| 72 | Greenland | 16 | 3 | 0.1875 |
| 136 | Northern Mariana Islands | 16 | 3 | 0.1875 |
| 132 | Niue | 16 | 2 | 0.1250 |
| 140 | Palau | 16 | 2 | 0.1250 |
| 141 | Palestine | 16 | 2 | 0.1250 |
| 181 | Taiwan | 16 | 2 | 0.1250 |
| 41 | Cook Islands | 16 | 2 | 0.1250 |
| 125 | Nauru | 16 | 1 | 0.0625 |
| 154 | Saint Kitts and Nevis | 16 | 1 | 0.0625 |
| 116 | Micronesia | 16 | 1 | 0.0625 |
| 112 | Marshall Islands | 16 | 1 | 0.0625 |
| 192 | Tuvalu | 16 | 1 | 0.0625 |
##################################
# Formulating the dataset
# with numeric columns only
##################################
cancer_death_rate_numeric = cancer_death_rate.select_dtypes(include='number')
##################################
# Gathering the variable names for each numeric column
##################################
numeric_variable_name_list = cancer_death_rate_numeric.columns
##################################
# Gathering the minimum value for each numeric column
##################################
numeric_minimum_list = cancer_death_rate_numeric.min()
##################################
# Gathering the mean value for each numeric column
##################################
numeric_mean_list = cancer_death_rate_numeric.mean()
##################################
# Gathering the median value for each numeric column
##################################
numeric_median_list = cancer_death_rate_numeric.median()
##################################
# Gathering the maximum value for each numeric column
##################################
numeric_maximum_list = cancer_death_rate_numeric.max()
##################################
# Gathering the first mode values for each numeric column
##################################
numeric_first_mode_list = [cancer_death_rate[x].value_counts(dropna=True).index.tolist()[0] for x in cancer_death_rate_numeric]
##################################
# Gathering the second mode values for each numeric column
##################################
numeric_second_mode_list = [cancer_death_rate[x].value_counts(dropna=True).index.tolist()[1] for x in cancer_death_rate_numeric]
##################################
# Gathering the count of first mode values for each numeric column
##################################
numeric_first_mode_count_list = [cancer_death_rate_numeric[x].isin([cancer_death_rate[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in cancer_death_rate_numeric]
##################################
# Gathering the count of second mode values for each numeric column
##################################
numeric_second_mode_count_list = [cancer_death_rate_numeric[x].isin([cancer_death_rate[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in cancer_death_rate_numeric]
##################################
# Gathering the first mode to second mode ratio for each numeric column
##################################
numeric_first_second_mode_ratio_list = map(truediv, numeric_first_mode_count_list, numeric_second_mode_count_list)
##################################
# Gathering the count of unique values for each numeric column
##################################
numeric_unique_count_list = cancer_death_rate_numeric.nunique(dropna=True)
##################################
# Gathering the number of observations for each numeric column
##################################
numeric_row_count_list = list([len(cancer_death_rate_numeric)] * len(cancer_death_rate_numeric.columns))
##################################
# Gathering the unique to count ratio for each numeric column
##################################
numeric_unique_count_ratio_list = map(truediv, numeric_unique_count_list, numeric_row_count_list)
##################################
# Gathering the skewness value for each numeric column
##################################
numeric_skewness_list = cancer_death_rate_numeric.skew()
##################################
# Gathering the kurtosis value for each numeric column
##################################
numeric_kurtosis_list = cancer_death_rate_numeric.kurtosis()
numeric_column_quality_summary = pd.DataFrame(zip(numeric_variable_name_list,
numeric_minimum_list,
numeric_mean_list,
numeric_median_list,
numeric_maximum_list,
numeric_first_mode_list,
numeric_second_mode_list,
numeric_first_mode_count_list,
numeric_second_mode_count_list,
numeric_first_second_mode_ratio_list,
numeric_unique_count_list,
numeric_row_count_list,
numeric_unique_count_ratio_list,
numeric_skewness_list,
numeric_kurtosis_list),
columns=['Numeric.Column.Name',
'Minimum',
'Mean',
'Median',
'Maximum',
'First.Mode',
'Second.Mode',
'First.Mode.Count',
'Second.Mode.Count',
'First.Second.Mode.Ratio',
'Unique.Count',
'Row.Count',
'Unique.Count.Ratio',
'Skewness',
'Kurtosis'])
if (len(cancer_death_rate_numeric.columns)==0):
print('No numeric columns identified from the data.')
else:
display(numeric_column_quality_summary)
| Numeric.Column.Name | Minimum | Mean | Median | Maximum | First.Mode | Second.Mode | First.Mode.Count | Second.Mode.Count | First.Second.Mode.Ratio | Unique.Count | Row.Count | Unique.Count.Ratio | Skewness | Kurtosis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | PROCAN | 2.8100 | 11.7260 | 10.0050 | 54.1500 | 15.4100 | 9.2300 | 2 | 2 | 1.0000 | 198 | 208 | 0.9519 | 2.1250 | 6.1837 |
| 1 | BRECAN | 4.6900 | 11.3350 | 10.5600 | 37.1000 | 10.2900 | 8.9900 | 3 | 2 | 1.5000 | 190 | 208 | 0.9135 | 1.5844 | 5.4634 |
| 2 | CERCAN | 0.7100 | 6.0651 | 4.4800 | 39.9500 | 4.6200 | 1.5200 | 3 | 3 | 1.0000 | 189 | 208 | 0.9087 | 1.9715 | 8.3399 |
| 3 | STOCAN | 3.4000 | 10.5975 | 9.1550 | 46.0400 | 7.0200 | 6.5800 | 2 | 2 | 1.0000 | 196 | 208 | 0.9423 | 2.0526 | 7.3909 |
| 4 | ESOCAN | 0.9600 | 4.8946 | 3.3100 | 25.7600 | 2.5200 | 1.6800 | 3 | 3 | 1.0000 | 180 | 208 | 0.8654 | 2.0659 | 5.2990 |
| 5 | PANCAN | 1.6000 | 6.6004 | 6.1150 | 19.2900 | 3.1300 | 3.0700 | 3 | 2 | 1.5000 | 187 | 208 | 0.8990 | 0.9127 | 1.5264 |
| 6 | LUNCAN | 5.9500 | 21.0217 | 20.0200 | 78.2300 | 10.7500 | 11.6200 | 3 | 2 | 1.5000 | 200 | 208 | 0.9615 | 1.2646 | 2.8631 |
| 7 | COLCAN | 4.9400 | 13.6945 | 12.7950 | 31.3800 | 10.9000 | 12.2900 | 2 | 2 | 1.0000 | 199 | 208 | 0.9567 | 0.7739 | 0.1459 |
| 8 | LIVCAN | 0.6500 | 5.9826 | 3.8950 | 115.2300 | 2.7500 | 2.7400 | 6 | 4 | 1.5000 | 173 | 208 | 0.8317 | 9.1131 | 104.2327 |
| 9 | SMPREV | 3.3000 | 17.0140 | 16.4000 | 41.1000 | 22.4000 | 26.5000 | 4 | 4 | 1.0000 | 141 | 208 | 0.6779 | 0.4096 | -0.4815 |
| 10 | OWPREV | 18.3000 | 48.9963 | 55.0000 | 88.5000 | 61.6000 | 28.4000 | 5 | 3 | 1.6667 | 157 | 208 | 0.7548 | -0.1617 | -0.9762 |
| 11 | ACSHAR | 0.0030 | 6.0013 | 5.7000 | 20.5000 | 0.6900 | 12.0300 | 3 | 2 | 1.5000 | 177 | 208 | 0.8510 | 0.3532 | -0.5657 |
| 12 | GEOLAT | -40.9006 | 19.0381 | 17.3443 | 71.7069 | 55.3781 | 53.4129 | 2 | 2 | 1.0000 | 206 | 208 | 0.9904 | -0.1861 | -0.6520 |
| 13 | GEOLON | -175.1982 | 16.2690 | 19.4388 | 179.4144 | -3.4360 | -8.2439 | 2 | 2 | 1.0000 | 206 | 208 | 0.9904 | -0.2025 | 0.3981 |
##################################
# Counting the number of numeric columns
# with First.Second.Mode.Ratio > 5.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['First.Second.Mode.Ratio']>5)])
0
##################################
# Identifying the numeric columns
# with First.Second.Mode.Ratio > 5.00
##################################
if (len(numeric_column_quality_summary[(numeric_column_quality_summary['First.Second.Mode.Ratio']>5)])==0):
print('No numeric columns with First.Second.Mode.Ratio > 5.00.')
else:
display(numeric_column_quality_summary[(numeric_column_quality_summary['First.Second.Mode.Ratio']>5)].sort_values(by=['First.Second.Mode.Ratio'], ascending=False))
No numeric columns with First.Second.Mode.Ratio > 5.00.
##################################
# Counting the number of numeric columns
# with Unique.Count.Ratio > 10.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['Unique.Count.Ratio']>10)])
0
##################################
# Counting the number of numeric columns
# with Skewness > 3.00 or Skewness < -3.00
##################################
len(numeric_column_quality_summary[(numeric_column_quality_summary['Skewness']>3)|(numeric_column_quality_summary['Skewness']<(-3))])
1
yy = numeric_column_quality_summary[(numeric_column_quality_summary['Skewness']>3) | (numeric_column_quality_summary['Skewness']<(-3))]
len(yy)
1
##################################
# Identifying the numeric columns
# with Skewness > 3.00 or Skewness < -3.00
##################################
if (len(numeric_column_quality_summary[(numeric_column_quality_summary['Skewness']>3) | (numeric_column_quality_summary['Skewness']<(-3))])==0):
print('No numeric columns with Skewness > 3.00 or Skewness < -3.00.')
else:
display(numeric_column_quality_summary[(numeric_column_quality_summary['Skewness']>3) | (numeric_column_quality_summary['Skewness']<(-3))].sort_values(by=['Skewness'], ascending=False))
| Numeric.Column.Name | Minimum | Mean | Median | Maximum | First.Mode | Second.Mode | First.Mode.Count | Second.Mode.Count | First.Second.Mode.Ratio | Unique.Count | Row.Count | Unique.Count.Ratio | Skewness | Kurtosis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | LIVCAN | 0.6500 | 5.9826 | 3.8950 | 115.2300 | 2.7500 | 2.7400 | 6 | 4 | 1.5000 | 173 | 208 | 0.8317 | 9.1131 | 104.2327 |
##################################
# Formulating the dataset
# with object column only
##################################
cancer_death_rate_object = cancer_death_rate.select_dtypes(include='object')
##################################
# Gathering the variable names for the object column
##################################
object_variable_name_list = cancer_death_rate_object.columns
##################################
# Gathering the first mode values for the object column
##################################
object_first_mode_list = [cancer_death_rate[x].value_counts().index.tolist()[0] for x in cancer_death_rate_object]
##################################
# Gathering the second mode values for each object column
##################################
object_second_mode_list = [cancer_death_rate[x].value_counts().index.tolist()[1] for x in cancer_death_rate_object]
##################################
# Gathering the count of first mode values for each object column
##################################
object_first_mode_count_list = [cancer_death_rate_object[x].isin([cancer_death_rate[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in cancer_death_rate_object]
##################################
# Gathering the count of second mode values for each object column
##################################
object_second_mode_count_list = [cancer_death_rate_object[x].isin([cancer_death_rate[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in cancer_death_rate_object]
##################################
# Gathering the first mode to second mode ratio for each object column
##################################
object_first_second_mode_ratio_list = map(truediv, object_first_mode_count_list, object_second_mode_count_list)
##################################
# Gathering the count of unique values for each object column
##################################
object_unique_count_list = cancer_death_rate_object.nunique(dropna=True)
##################################
# Gathering the number of observations for each object column
##################################
object_row_count_list = list([len(cancer_death_rate_object)] * len(cancer_death_rate_object.columns))
##################################
# Gathering the unique to count ratio for each object column
##################################
object_unique_count_ratio_list = map(truediv, object_unique_count_list, object_row_count_list)
object_column_quality_summary = pd.DataFrame(zip(object_variable_name_list,
object_first_mode_list,
object_second_mode_list,
object_first_mode_count_list,
object_second_mode_count_list,
object_first_second_mode_ratio_list,
object_unique_count_list,
object_row_count_list,
object_unique_count_ratio_list),
columns=['Object.Column.Name',
'First.Mode',
'Second.Mode',
'First.Mode.Count',
'Second.Mode.Count',
'First.Second.Mode.Ratio',
'Unique.Count',
'Row.Count',
'Unique.Count.Ratio'])
if (len(cancer_death_rate_object.columns)==0):
print('No object columns identified from the data.')
else:
display(object_column_quality_summary)
| Object.Column.Name | First.Mode | Second.Mode | First.Mode.Count | Second.Mode.Count | First.Second.Mode.Ratio | Unique.Count | Row.Count | Unique.Count.Ratio | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | COUNTRY | Afghanistan | Albania | 1 | 1 | 1.0000 | 208 | 208 | 1.0000 |
| 1 | CODE | AFG | PSX | 1 | 1 | 1.0000 | 203 | 208 | 0.9760 |
##################################
# Counting the number of object columns
# with First.Second.Mode.Ratio > 5.00
##################################
len(object_column_quality_summary[(object_column_quality_summary['First.Second.Mode.Ratio']>5)])
0
##################################
# Counting the number of object columns
# with Unique.Count.Ratio > 10.00
##################################
len(object_column_quality_summary[(object_column_quality_summary['Unique.Count.Ratio']>10)])
0
##################################
# Formulating the dataset
# with categorical columns only
##################################
cancer_death_rate_categorical = cancer_death_rate.select_dtypes(include='category')
##################################
# Gathering the variable names for the categorical column
##################################
categorical_variable_name_list = cancer_death_rate_categorical.columns
##################################
# Gathering the first mode values for each categorical column
##################################
categorical_first_mode_list = [cancer_death_rate[x].value_counts().index.tolist()[0] for x in cancer_death_rate_categorical]
##################################
# Gathering the second mode values for each categorical column
##################################
categorical_second_mode_list = [cancer_death_rate[x].value_counts().index.tolist()[1] for x in cancer_death_rate_categorical]
##################################
# Gathering the count of first mode values for each categorical column
##################################
categorical_first_mode_count_list = [cancer_death_rate_categorical[x].isin([cancer_death_rate[x].value_counts(dropna=True).index.tolist()[0]]).sum() for x in cancer_death_rate_categorical]
##################################
# Gathering the count of second mode values for each categorical column
##################################
categorical_second_mode_count_list = [cancer_death_rate_categorical[x].isin([cancer_death_rate[x].value_counts(dropna=True).index.tolist()[1]]).sum() for x in cancer_death_rate_categorical]
##################################
# Gathering the first mode to second mode ratio for each categorical column
##################################
categorical_first_second_mode_ratio_list = map(truediv, categorical_first_mode_count_list, categorical_second_mode_count_list)
##################################
# Gathering the count of unique values for each categorical column
##################################
categorical_unique_count_list = cancer_death_rate_categorical.nunique(dropna=True)
##################################
# Gathering the number of observations for each categorical column
##################################
categorical_row_count_list = list([len(cancer_death_rate_categorical)] * len(cancer_death_rate_categorical.columns))
##################################
# Gathering the unique to count ratio for each categorical column
##################################
categorical_unique_count_ratio_list = map(truediv, categorical_unique_count_list, categorical_row_count_list)
categorical_column_quality_summary = pd.DataFrame(zip(categorical_variable_name_list,
categorical_first_mode_list,
categorical_second_mode_list,
categorical_first_mode_count_list,
categorical_second_mode_count_list,
categorical_first_second_mode_ratio_list,
categorical_unique_count_list,
categorical_row_count_list,
categorical_unique_count_ratio_list),
columns=['Categorical.Column.Name',
'First.Mode',
'Second.Mode',
'First.Mode.Count',
'Second.Mode.Count',
'First.Second.Mode.Ratio',
'Unique.Count',
'Row.Count',
'Unique.Count.Ratio'])
if (len(cancer_death_rate_categorical.columns)==0):
print('No categorical columns identified from the data.')
else:
display(categorical_column_quality_summary)
No categorical columns identified from the data.
##################################
# Counting the number of categorical columns
# with First.Second.Mode.Ratio > 5.00
##################################
len(categorical_column_quality_summary[(categorical_column_quality_summary['First.Second.Mode.Ratio']>5)])
0
##################################
# Counting the number of categorical columns
# with Unique.Count.Ratio > 10.00
##################################
len(categorical_column_quality_summary[(categorical_column_quality_summary['Unique.Count.Ratio']>10)])
0
##################################
# Performing a general exploration of the original dataset
##################################
print('Dataset Dimensions: ')
display(cancer_death_rate.shape)
Dataset Dimensions:
(208, 16)
##################################
# Filtering out the rows with
# with Missing.Rate > 0.00
##################################
cancer_death_rate_filtered_row = cancer_death_rate.drop(cancer_death_rate[cancer_death_rate.COUNTRY.isin(row_missing_rate['Row.Name'].values.tolist())].index)
##################################
# Performing a general exploration of the filtered dataset
##################################
print('Dataset Dimensions: ')
display(cancer_death_rate_filtered_row.shape)
Dataset Dimensions:
(183, 16)
##################################
# Re-evaluating the missing data summary
# for the filtered data
##################################
variable_name_list = list(cancer_death_rate_filtered_row.columns)
null_count_list = list(cancer_death_rate_filtered_row.isna().sum(axis=0))
all_column_quality_summary = pd.DataFrame(zip(variable_name_list,
null_count_list),
columns=['Column.Name',
'Null.Count'])
display(all_column_quality_summary)
| Column.Name | Null.Count | |
|---|---|---|
| 0 | COUNTRY | 0 |
| 1 | CODE | 0 |
| 2 | PROCAN | 0 |
| 3 | BRECAN | 0 |
| 4 | CERCAN | 0 |
| 5 | STOCAN | 0 |
| 6 | ESOCAN | 0 |
| 7 | PANCAN | 0 |
| 8 | LUNCAN | 0 |
| 9 | COLCAN | 0 |
| 10 | LIVCAN | 0 |
| 11 | SMPREV | 0 |
| 12 | OWPREV | 0 |
| 13 | ACSHAR | 0 |
| 14 | GEOLAT | 0 |
| 15 | GEOLON | 0 |
##################################
# Identifying the columns
# with Null.Count > 1.00
##################################
len(all_column_quality_summary[(all_column_quality_summary['Null.Count']>1.00)])
0
##################################
# Formulating a new dataset object
# for the cleaned data
##################################
cancer_death_rate_cleaned = cancer_death_rate_filtered_row
cancer_death_rate_cleaned.reset_index(drop=True,inplace=True)
##################################
# Performing a general exploration of the filtered dataset
##################################
print('Dataset Dimensions: ')
display(cancer_death_rate_cleaned.shape)
Dataset Dimensions:
(183, 16)
##################################
# Formulating the cleaned dataset
# with geolocation data
##################################
cancer_death_rate_cleaned_numeric = cancer_death_rate_cleaned.select_dtypes(include='number')
cancer_death_rate_cleaned_numeric_geolocation = cancer_death_rate_cleaned_numeric[['GEOLAT','GEOLON']]
##################################
# Formulating the cleaned dataset
# with numeric columns only
# without the geolocation data
##################################
cancer_death_rate_cleaned_numeric.drop(['GEOLAT','GEOLON'], inplace=True, axis=1)
##################################
# Gathering the variable names for each numeric column
##################################
numeric_variable_name_list = list(cancer_death_rate_cleaned_numeric.columns)
##################################
# Gathering the skewness value for each numeric column
##################################
numeric_skewness_list = cancer_death_rate_cleaned_numeric.skew()
##################################
# Computing the interquartile range
# for all columns
##################################
cancer_death_rate_cleaned_numeric_q1 = cancer_death_rate_cleaned_numeric.quantile(0.25)
cancer_death_rate_cleaned_numeric_q3 = cancer_death_rate_cleaned_numeric.quantile(0.75)
cancer_death_rate_cleaned_numeric_iqr = cancer_death_rate_cleaned_numeric_q3 - cancer_death_rate_cleaned_numeric_q1
##################################
# Gathering the outlier count for each numeric column
# based on the interquartile range criterion
##################################
numeric_outlier_count_list = ((cancer_death_rate_cleaned_numeric < (cancer_death_rate_cleaned_numeric_q1 - 1.5 * cancer_death_rate_cleaned_numeric_iqr)) | (cancer_death_rate_cleaned_numeric > (cancer_death_rate_cleaned_numeric_q3 + 1.5 * cancer_death_rate_cleaned_numeric_iqr))).sum()
##################################
# Gathering the number of observations for each column
##################################
numeric_row_count_list = list([len(cancer_death_rate_cleaned_numeric)] * len(cancer_death_rate_cleaned_numeric.columns))
##################################
# Gathering the unique to count ratio for each categorical column
##################################
numeric_outlier_ratio_list = map(truediv, numeric_outlier_count_list, numeric_row_count_list)
##################################
# Formulating the outlier summary
# for all numeric columns
##################################
numeric_column_outlier_summary = pd.DataFrame(zip(numeric_variable_name_list,
numeric_skewness_list,
numeric_outlier_count_list,
numeric_row_count_list,
numeric_outlier_ratio_list),
columns=['Numeric.Column.Name',
'Skewness',
'Outlier.Count',
'Row.Count',
'Outlier.Ratio'])
display(numeric_column_outlier_summary)
| Numeric.Column.Name | Skewness | Outlier.Count | Row.Count | Outlier.Ratio | |
|---|---|---|---|---|---|
| 0 | PROCAN | 2.2461 | 11 | 183 | 0.0601 |
| 1 | BRECAN | 1.9575 | 8 | 183 | 0.0437 |
| 2 | CERCAN | 1.9896 | 2 | 183 | 0.0109 |
| 3 | STOCAN | 2.0858 | 6 | 183 | 0.0328 |
| 4 | ESOCAN | 2.0918 | 24 | 183 | 0.1311 |
| 5 | PANCAN | 0.5992 | 1 | 183 | 0.0055 |
| 6 | LUNCAN | 0.8574 | 2 | 183 | 0.0109 |
| 7 | COLCAN | 0.8201 | 2 | 183 | 0.0109 |
| 8 | LIVCAN | 8.7158 | 19 | 183 | 0.1038 |
| 9 | SMPREV | 0.4165 | 0 | 183 | 0.0000 |
| 10 | OWPREV | -0.3341 | 0 | 183 | 0.0000 |
| 11 | ACSHAR | 0.3372 | 1 | 183 | 0.0055 |
##################################
# Formulating the individual boxplots
# for all numeric columns
##################################
for column in cancer_death_rate_cleaned_numeric:
plt.figure(figsize=(17,1))
sns.boxplot(data=cancer_death_rate_cleaned_numeric, x=column)
##################################
# Formulating a function
# to plot the correlation matrix
# for all pairwise combinations
# of numeric columns
##################################
def plot_correlation_matrix(corr, mask=None):
f, ax = plt.subplots(figsize=(11, 9))
sns.heatmap(corr,
ax=ax,
mask=mask,
annot=True,
vmin=-1,
vmax=1,
center=0,
cmap='coolwarm',
linewidths=1,
linecolor='gray',
cbar_kws={'orientation': 'horizontal'})
##################################
# Computing the correlation coefficients
# and correlation p-values
# among pairs of numeric columns
##################################
cancer_death_rate_cleaned_numeric_correlation_pairs = {}
cancer_death_rate_cleaned_numeric_columns = cancer_death_rate_cleaned_numeric.columns.tolist()
for numeric_column_a, numeric_column_b in itertools.combinations(cancer_death_rate_cleaned_numeric_columns, 2):
cancer_death_rate_cleaned_numeric_correlation_pairs[numeric_column_a + '_' + numeric_column_b] = stats.pearsonr(
cancer_death_rate_cleaned_numeric.loc[:, numeric_column_a],
cancer_death_rate_cleaned_numeric.loc[:, numeric_column_b])
##################################
# Formulating the pairwise correlation summary
# for all numeric columns
##################################
cancer_death_rate_cleaned_numeric_summary = cancer_death_rate_cleaned_numeric.from_dict(cancer_death_rate_cleaned_numeric_correlation_pairs, orient='index')
cancer_death_rate_cleaned_numeric_summary.columns = ['Pearson.Correlation.Coefficient', 'Correlation.PValue']
display(cancer_death_rate_cleaned_numeric_summary.sort_values(by=['Pearson.Correlation.Coefficient'], ascending=False).head(20))
| Pearson.Correlation.Coefficient | Correlation.PValue | |
|---|---|---|
| PANCAN_COLCAN | 0.7537 | 0.0000 |
| LUNCAN_COLCAN | 0.7010 | 0.0000 |
| LUNCAN_SMPREV | 0.6415 | 0.0000 |
| PANCAN_LUNCAN | 0.6367 | 0.0000 |
| COLCAN_ACSHAR | 0.5819 | 0.0000 |
| PANCAN_ACSHAR | 0.5750 | 0.0000 |
| PANCAN_OWPREV | 0.5212 | 0.0000 |
| CERCAN_ESOCAN | 0.4803 | 0.0000 |
| LUNCAN_ACSHAR | 0.4330 | 0.0000 |
| STOCAN_LIVCAN | 0.4291 | 0.0000 |
| SMPREV_OWPREV | 0.4164 | 0.0000 |
| COLCAN_SMPREV | 0.4126 | 0.0000 |
| COLCAN_OWPREV | 0.4102 | 0.0000 |
| LUNCAN_OWPREV | 0.4087 | 0.0000 |
| PROCAN_BRECAN | 0.4081 | 0.0000 |
| PROCAN_CERCAN | 0.3650 | 0.0000 |
| PANCAN_SMPREV | 0.3603 | 0.0000 |
| BRECAN_CERCAN | 0.3589 | 0.0000 |
| ESOCAN_LIVCAN | 0.3009 | 0.0000 |
| CERCAN_STOCAN | 0.2790 | 0.0001 |
##################################
# Plotting the correlation matrix
# for all pairwise combinations
# of numeric columns
##################################
cancer_death_rate_cleaned_numeric_correlation = cancer_death_rate_cleaned_numeric.corr()
mask = np.triu(cancer_death_rate_cleaned_numeric_correlation)
plot_correlation_matrix(cancer_death_rate_cleaned_numeric_correlation,mask)
plt.show()
##################################
# Formulating a function
# to plot the correlation matrix
# for all pairwise combinations
# of numeric columns
# with significant p-values only
##################################
def correlation_significance(df=None):
p_matrix = np.zeros(shape=(df.shape[1],df.shape[1]))
for col in df.columns:
for col2 in df.drop(col,axis=1).columns:
_ , p = stats.pearsonr(df[col],df[col2])
p_matrix[df.columns.to_list().index(col),df.columns.to_list().index(col2)] = p
return p_matrix
##################################
# Plotting the correlation matrix
# for all pairwise combinations
# of numeric columns
# with significant p-values only
##################################
cancer_death_rate_cleaned_numeric_correlation_p_values = correlation_significance(cancer_death_rate_cleaned_numeric)
mask = np.invert(np.tril(cancer_death_rate_cleaned_numeric_correlation_p_values<0.05))
plot_correlation_matrix(cancer_death_rate_cleaned_numeric_correlation,mask)
##################################
# Performing a general exploration of the filtered dataset
##################################
print('Dataset Dimensions: ')
display(cancer_death_rate_cleaned_numeric.shape)
Dataset Dimensions:
(183, 12)
##################################
# Conducting a Yeo-Johnson Transformation
# to address the distributional
# shape of the variables
##################################
yeo_johnson_transformer = PowerTransformer(method='yeo-johnson',
standardize=False)
cancer_death_rate_cleaned_numeric_array = yeo_johnson_transformer.fit_transform(cancer_death_rate_cleaned_numeric)
##################################
# Formulating a new dataset object
# for the transformed data
##################################
cancer_death_rate_transformed_numeric = pd.DataFrame(cancer_death_rate_cleaned_numeric_array,
columns=cancer_death_rate_cleaned_numeric.columns)
##################################
# Formulating the individual boxplots
# for all transformed numeric columns
##################################
for column in cancer_death_rate_transformed_numeric:
plt.figure(figsize=(17,1))
sns.boxplot(data=cancer_death_rate_transformed_numeric, x=column)
##################################
# Performing a general exploration of the filtered dataset
##################################
print('Dataset Dimensions: ')
display(cancer_death_rate_transformed_numeric.shape)
Dataset Dimensions:
(183, 12)
cancer_death_rate_transformed_numeric
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | SMPREV | OWPREV | ACSHAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.5595 | 1.5203 | 1.4836 | 2.1487 | 1.1939 | 1.5456 | 2.4417 | 1.9846 | 1.1035 | 4.7108 | 46.3969 | 0.2004 |
| 1 | 1.7272 | 1.4076 | 0.9307 | 1.7470 | 0.6928 | 2.6330 | 3.0570 | 2.0417 | 1.0378 | 6.4628 | 149.2448 | 3.8224 |
| 2 | 1.4670 | 1.4686 | 1.1002 | 1.4007 | 0.6154 | 2.0444 | 2.2954 | 1.9526 | 0.7708 | 4.5421 | 163.6112 | 0.7992 |
| 3 | 1.7704 | 1.5352 | 1.0595 | 1.6330 | 1.0009 | 3.2883 | 3.2603 | 2.6740 | 1.0911 | 7.4729 | 169.3720 | 5.1050 |
| 4 | 1.9120 | 1.6229 | 2.2285 | 1.6682 | 1.2514 | 1.7808 | 2.5816 | 2.0787 | 0.8161 | 3.8904 | 58.1402 | 3.7374 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 178 | 1.9905 | 1.5533 | 1.9988 | 1.8183 | 0.8160 | 2.4509 | 2.8083 | 2.1890 | 0.7997 | 5.7294 | 168.3522 | 2.5885 |
| 179 | 1.2905 | 1.6500 | 1.6597 | 1.7169 | 0.9307 | 2.1167 | 3.0677 | 2.4898 | 0.8322 | 6.4983 | 34.8080 | 4.3473 |
| 180 | 1.5002 | 1.4362 | 1.0565 | 2.0113 | 1.0697 | 1.3550 | 2.3069 | 1.8210 | 0.8885 | 5.2531 | 120.5007 | 0.0504 |
| 181 | 1.9556 | 1.6259 | 2.3874 | 1.6536 | 1.3591 | 2.2641 | 2.3685 | 2.2649 | 0.8543 | 4.5909 | 58.9437 | 3.5868 |
| 182 | 2.1649 | 1.7314 | 2.5992 | 1.8555 | 1.3729 | 2.9086 | 2.5914 | 2.2817 | 1.1442 | 4.5666 | 88.2092 | 2.8255 |
183 rows × 12 columns
##################################
# Conducting standardization
# to transform the values of the
# variables into comparable scale
##################################
standardization_scaler = StandardScaler()
cancer_death_rate_transformed_numeric_array = standardization_scaler.fit_transform(cancer_death_rate_transformed_numeric)
##################################
# Formulating a new dataset object
# for the scaled data
##################################
cancer_death_rate_scaled_numeric = pd.DataFrame(cancer_death_rate_transformed_numeric_array,
columns=cancer_death_rate_transformed_numeric.columns)
##################################
# Formulating the individual boxplots
# for all transformed numeric columns
##################################
for column in cancer_death_rate_scaled_numeric:
plt.figure(figsize=(17,1))
sns.boxplot(data=cancer_death_rate_scaled_numeric, x=column)
##################################
# Consolidating both numeric columns
# and geolocation data
##################################
cancer_death_rate_preprocessed = pd.concat([cancer_death_rate_scaled_numeric,cancer_death_rate_cleaned_numeric_geolocation], axis=1, join='inner')
##################################
# Performing a general exploration of the consolidated dataset
##################################
print('Dataset Dimensions: ')
display(cancer_death_rate_preprocessed.shape)
Dataset Dimensions:
(183, 14)
##################################
# Segregating the target
# and descriptor variable lists
##################################
cancer_death_rate_preprocessed_target_SMPREV = ['SMPREV']
cancer_death_rate_preprocessed_target_OWPREV = ['OWPREV']
cancer_death_rate_preprocessed_target_ACSHAR = ['ACSHAR']
cancer_death_rate_preprocessed_descriptors = cancer_death_rate_preprocessed.drop(['SMPREV','OWPREV','ACSHAR','GEOLAT','GEOLON'], axis=1).columns
##################################
# Segregating the target using SMPREV
# and descriptor variable names
##################################
y_variable = 'SMPREV'
x_variables = cancer_death_rate_preprocessed_descriptors
##################################
# Defining the number of
# rows and columns for the subplots
##################################
num_rows = 3
num_cols = 3
##################################
# Formulating the individual scatterplots
# for all scaled numeric columns
##################################
plt.figure(figsize=[15,15])
for i, x_variable in enumerate(x_variables):
plt.subplot(num_rows,num_cols,i+1)
sns.regplot(data=cancer_death_rate_preprocessed,x=x_variable,y=y_variable, line_kws={"color":'red'})
plt.show()
##################################
# Segregating the target using OWPREV
# and descriptor variable names
##################################
y_variable = 'OWPREV'
x_variables = cancer_death_rate_preprocessed_descriptors
##################################
# Defining the number of
# rows and columns for the subplots
##################################
num_rows = 3
num_cols = 3
##################################
# Formulating the individual scatterplots
# for all scaled numeric columns
##################################
plt.figure(figsize=[15,15])
for i, x_variable in enumerate(x_variables):
plt.subplot(num_rows,num_cols,i+1)
sns.regplot(data=cancer_death_rate_preprocessed,x=x_variable,y=y_variable, line_kws={"color":'red'})
plt.show()
##################################
# Segregating the target using ACSHAR
# and descriptor variable names
##################################
y_variable = 'ACSHAR'
x_variables = cancer_death_rate_preprocessed_descriptors
##################################
# Defining the number of
# rows and columns for the subplots
##################################
num_rows = 3
num_cols = 3
##################################
# Formulating the individual scatterplots
# for all scaled numeric columns
##################################
plt.figure(figsize=[15,15])
for i, x_variable in enumerate(x_variables):
plt.subplot(num_rows,num_cols,i+1)
sns.regplot(data=cancer_death_rate_preprocessed,x=x_variable,y=y_variable, line_kws={"color":'red'})
plt.show()
##################################
# Computing the correlation coefficients
# and correlation p-values
# between the target descriptor using SMPREV
# and numeric descriptor columns
##################################
cancer_death_rate_preprocessed_numeric_correlation_target = {}
cancer_death_rate_preprocessed_numeric = cancer_death_rate_preprocessed.drop(['OWPREV','ACSHAR','GEOLAT','GEOLON'], axis=1)
cancer_death_rate_preprocessed_numeric_columns = cancer_death_rate_preprocessed_numeric.columns.tolist()
for numeric_column in cancer_death_rate_preprocessed_numeric_columns:
cancer_death_rate_preprocessed_numeric_correlation_target['SMPREV_' + numeric_column] = stats.pearsonr(
cancer_death_rate_preprocessed_numeric.loc[:, 'SMPREV'],
cancer_death_rate_preprocessed_numeric.loc[:, numeric_column])
##################################
# Formulating the pairwise correlation summary
# between the target descriptor
# and numeric descriptor columns
##################################
cancer_death_rate_preprocessed_numeric_summary = cancer_death_rate_preprocessed_numeric.from_dict(cancer_death_rate_preprocessed_numeric_correlation_target, orient='index')
cancer_death_rate_preprocessed_numeric_summary.columns = ['Pearson.Correlation.Coefficient', 'Correlation.PValue']
display(cancer_death_rate_preprocessed_numeric_summary.sort_values(by=['Correlation.PValue'], ascending=True).head(10))
| Pearson.Correlation.Coefficient | Correlation.PValue | |
|---|---|---|
| SMPREV_SMPREV | 1.0000 | 0.0000 |
| SMPREV_LUNCAN | 0.6538 | 0.0000 |
| SMPREV_CERCAN | -0.4866 | 0.0000 |
| SMPREV_PROCAN | -0.4232 | 0.0000 |
| SMPREV_COLCAN | 0.4198 | 0.0000 |
| SMPREV_PANCAN | 0.3604 | 0.0000 |
| SMPREV_ESOCAN | -0.2655 | 0.0003 |
| SMPREV_STOCAN | -0.1196 | 0.1070 |
| SMPREV_LIVCAN | 0.1163 | 0.1171 |
| SMPREV_BRECAN | 0.0566 | 0.4465 |
##################################
# Computing the correlation coefficients
# and correlation p-values
# between the target descriptor using OWPREV
# and numeric descriptor columns
##################################
cancer_death_rate_preprocessed_numeric_correlation_target = {}
cancer_death_rate_preprocessed_numeric = cancer_death_rate_preprocessed.drop(['SMPREV','ACSHAR','GEOLAT','GEOLON'], axis=1)
cancer_death_rate_preprocessed_numeric_columns = cancer_death_rate_preprocessed_numeric.columns.tolist()
for numeric_column in cancer_death_rate_preprocessed_numeric_columns:
cancer_death_rate_preprocessed_numeric_correlation_target['OWPREV_' + numeric_column] = stats.pearsonr(
cancer_death_rate_preprocessed_numeric.loc[:, 'OWPREV'],
cancer_death_rate_preprocessed_numeric.loc[:, numeric_column])
##################################
# Formulating the pairwise correlation summary
# between the target descriptor
# and numeric descriptor columns
##################################
cancer_death_rate_preprocessed_numeric_summary = cancer_death_rate_preprocessed_numeric.from_dict(cancer_death_rate_preprocessed_numeric_correlation_target, orient='index')
cancer_death_rate_preprocessed_numeric_summary.columns = ['Pearson.Correlation.Coefficient', 'Correlation.PValue']
display(cancer_death_rate_preprocessed_numeric_summary.sort_values(by=['Correlation.PValue'], ascending=True).head(10))
| Pearson.Correlation.Coefficient | Correlation.PValue | |
|---|---|---|
| OWPREV_OWPREV | 1.0000 | 0.0000 |
| OWPREV_PANCAN | 0.5360 | 0.0000 |
| OWPREV_CERCAN | -0.4677 | 0.0000 |
| OWPREV_ESOCAN | -0.4489 | 0.0000 |
| OWPREV_LUNCAN | 0.4445 | 0.0000 |
| OWPREV_COLCAN | 0.4442 | 0.0000 |
| OWPREV_STOCAN | -0.1189 | 0.1088 |
| OWPREV_BRECAN | 0.0490 | 0.5105 |
| OWPREV_PROCAN | 0.0280 | 0.7072 |
| OWPREV_LIVCAN | -0.0214 | 0.7737 |
##################################
# Computing the correlation coefficients
# and correlation p-values
# between the target descriptor using ACSHAR
# and numeric descriptor columns
##################################
cancer_death_rate_preprocessed_numeric_correlation_target = {}
cancer_death_rate_preprocessed_numeric = cancer_death_rate_preprocessed.drop(['SMPREV','OWPREV','GEOLAT','GEOLON'], axis=1)
cancer_death_rate_preprocessed_numeric_columns = cancer_death_rate_preprocessed_numeric.columns.tolist()
for numeric_column in cancer_death_rate_preprocessed_numeric_columns:
cancer_death_rate_preprocessed_numeric_correlation_target['ACSHAR_' + numeric_column] = stats.pearsonr(
cancer_death_rate_preprocessed_numeric.loc[:, 'ACSHAR'],
cancer_death_rate_preprocessed_numeric.loc[:, numeric_column])
##################################
# Formulating the pairwise correlation summary
# between the target descriptor
# and numeric descriptor columns
##################################
cancer_death_rate_preprocessed_numeric_summary = cancer_death_rate_preprocessed_numeric.from_dict(cancer_death_rate_preprocessed_numeric_correlation_target, orient='index')
cancer_death_rate_preprocessed_numeric_summary.columns = ['Pearson.Correlation.Coefficient', 'Correlation.PValue']
display(cancer_death_rate_preprocessed_numeric_summary.sort_values(by=['Correlation.PValue'], ascending=True).head(10))
| Pearson.Correlation.Coefficient | Correlation.PValue | |
|---|---|---|
| ACSHAR_ACSHAR | 1.0000 | 0.0000 |
| ACSHAR_COLCAN | 0.6039 | 0.0000 |
| ACSHAR_PANCAN | 0.5929 | 0.0000 |
| ACSHAR_LUNCAN | 0.4403 | 0.0000 |
| ACSHAR_PROCAN | 0.2083 | 0.0047 |
| ACSHAR_BRECAN | 0.1759 | 0.0172 |
| ACSHAR_CERCAN | -0.1347 | 0.0690 |
| ACSHAR_STOCAN | -0.1249 | 0.0921 |
| ACSHAR_ESOCAN | 0.0732 | 0.3248 |
| ACSHAR_LIVCAN | -0.0709 | 0.3401 |
##################################
# Consolidating relevant numeric columns
# after hypothesis testing
##################################
cancer_death_rate_premodelling = cancer_death_rate_preprocessed.drop(['GEOLAT','GEOLON'], axis=1)
##################################
# Performing a general exploration of the premodelling dataset
##################################
print('Dataset Dimensions: ')
display(cancer_death_rate_premodelling.shape)
Dataset Dimensions:
(183, 12)
##################################
# Listing the column names and data types
##################################
print('Column Names and Data Types:')
display(cancer_death_rate_premodelling.dtypes)
Column Names and Data Types:
PROCAN float64 BRECAN float64 CERCAN float64 STOCAN float64 ESOCAN float64 PANCAN float64 LUNCAN float64 COLCAN float64 LIVCAN float64 SMPREV float64 OWPREV float64 ACSHAR float64 dtype: object
##################################
# Taking a snapshot of the dataset
##################################
cancer_death_rate_premodelling.head()
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | SMPREV | OWPREV | ACSHAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.6922 | -0.4550 | -0.1771 | 2.0964 | 0.9425 | -1.4794 | -0.6095 | -0.9258 | 1.4059 | -0.5405 | -1.4979 | -1.6782 |
| 1 | -0.0867 | -1.3608 | -1.1020 | 0.3084 | -1.4329 | 0.2506 | 0.8754 | -0.7177 | 0.8924 | 0.5329 | 0.6090 | 0.4008 |
| 2 | -1.0261 | -0.8704 | -0.8184 | -1.2331 | -1.8001 | -0.6858 | -0.9625 | -1.0428 | -1.1914 | -0.6438 | 0.9033 | -1.3345 |
| 3 | 0.0691 | -0.3352 | -0.8866 | -0.1990 | 0.0272 | 1.2933 | 1.3658 | 1.5903 | 1.3091 | 1.1517 | 1.0213 | 1.1371 |
| 4 | 0.5801 | 0.3703 | 1.0686 | -0.0427 | 1.2150 | -1.1052 | -0.2718 | -0.5826 | -0.8379 | -1.0431 | -1.2574 | 0.3520 |
##################################
# Gathering the pairplot for all variables
##################################
sns.pairplot(cancer_death_rate_premodelling,
kind='reg',
plot_kws={'scatter_kws': {'alpha': 0.3},
'line_kws':{'color':'red'}})
plt.show()
##################################
# Preparing the clustering dataset
##################################
cancer_death_rate_premodelling_clustering = cancer_death_rate_premodelling.drop(['SMPREV','OWPREV','ACSHAR'], axis=1)
cancer_death_rate_premodelling_clustering.head()
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.6922 | -0.4550 | -0.1771 | 2.0964 | 0.9425 | -1.4794 | -0.6095 | -0.9258 | 1.4059 |
| 1 | -0.0867 | -1.3608 | -1.1020 | 0.3084 | -1.4329 | 0.2506 | 0.8754 | -0.7177 | 0.8924 |
| 2 | -1.0261 | -0.8704 | -0.8184 | -1.2331 | -1.8001 | -0.6858 | -0.9625 | -1.0428 | -1.1914 |
| 3 | 0.0691 | -0.3352 | -0.8866 | -0.1990 | 0.0272 | 1.2933 | 1.3658 | 1.5903 | 1.3091 |
| 4 | 0.5801 | 0.3703 | 1.0686 | -0.0427 | 1.2150 | -1.1052 | -0.2718 | -0.5826 | -0.8379 |
##################################
# Preparing the cross-validation data
# and parameters to be evaluated
# for the K-Means Clustering algorithm
##################################
X = cancer_death_rate_premodelling_clustering.copy()
kmeans_kfold_cluster_list = range(2, 10)
kmeans_kfold_cluster_silhouette_score = []
##################################
# Conducting the 5-fold cross-validation
# using the defined parameters
# for the K-Means Clustering algorithm
# for each individual cluster count
##################################
for k in kmeans_kfold_cluster_list:
##################################
# Defining the hyperparameters
##################################
km = KMeans(n_clusters=k,
random_state=88888888,
n_init='auto',
init='k-means++')
##################################
# Defining the 5-fold groups
##################################
kfold = KFold(n_splits=5,
shuffle=True,
random_state=88888888)
scores = []
for train_index, test_index in kfold.split(X):
##################################
# Formulating 5-fold groups
##################################
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
##################################
# Fitting the K-Means Clustering algorithm
# on the train data
##################################
km.fit(X_train)
##################################
# Assigning clusters for the test data
##################################
labels = km.predict(X_test)
##################################
# Computing for the silhouette score
# on the test data
##################################
score = silhouette_score(X_test, labels)
scores.append(score)
##################################
# Calculating the average silhouette score
# for the given cluster count
##################################
average_score = np.mean(scores)
kmeans_kfold_cluster_silhouette_score.append(average_score)
##################################
# Consolidating the model performance metrics
# for the K-Means Clustering algorithm
# using a range of K values
##################################
kmeans_clustering_kfold_summary = pd.DataFrame(zip(kmeans_kfold_cluster_list,
kmeans_kfold_cluster_silhouette_score),
columns=['KMeans.KFold.Cluster.Count',
'KMeans.KFold.Cluster.Average.Silhouette.Score'])
kmeans_clustering_kfold_summary
| KMeans.KFold.Cluster.Count | KMeans.KFold.Cluster.Average.Silhouette.Score | |
|---|---|---|
| 0 | 2 | 0.2315 |
| 1 | 3 | 0.2268 |
| 2 | 4 | 0.1674 |
| 3 | 5 | 0.1419 |
| 4 | 6 | 0.1475 |
| 5 | 7 | 0.1410 |
| 6 | 8 | 0.0957 |
| 7 | 9 | 0.0862 |
###################################
# Plotting the Average Silhouette Score performance
# by cluster count using the 5-fold results
# for the K-Means Clustering algorithm
##################################
kmeans_kfold_cluster_count_values = np.array(kmeans_clustering_kfold_summary['KMeans.KFold.Cluster.Count'].values)
kmeans_kfold_silhouette_score_values = np.array(kmeans_clustering_kfold_summary['KMeans.KFold.Cluster.Average.Silhouette.Score'].values)
plt.figure(figsize=(10, 6))
plt.plot(kmeans_kfold_cluster_count_values, kmeans_kfold_silhouette_score_values, marker='o',ls='-')
plt.grid(True)
plt.ylim(0,0.3)
plt.title("K-Means Clustering Algorithm: Cluster Count by Cross-Validated Silhouette Score")
plt.xlabel("Cluster")
plt.ylabel("Average Silhouette Score")
plt.show()
##################################
# Fitting the K-Means Clustering algorithm
# using a range of K values
# for the complete dataset
##################################
kmeans_cluster_list = list()
kmeans_cluster_inertia = list()
kmeans_cluster_silhouette_score = list()
for cluster_count in range(2,10):
km = KMeans(n_clusters=cluster_count,
random_state=88888888,
n_init='auto',
init='k-means++')
km = km.fit(cancer_death_rate_premodelling_clustering)
kmeans_cluster_list.append(cluster_count)
kmeans_cluster_inertia.append(km.inertia_)
kmeans_cluster_silhouette_score.append(silhouette_score(cancer_death_rate_premodelling_clustering,
km.predict(cancer_death_rate_premodelling_clustering),
metric='euclidean'))
##################################
# Consolidating the model performance metrics
# for the K-Means Clustering algorithm
# using a range of K values
# for the complete dataset
##################################
kmeans_clustering_evaluation_summary = pd.DataFrame(zip(kmeans_cluster_list,
kmeans_cluster_inertia,
kmeans_cluster_silhouette_score),
columns=['KMeans.Cluster.Count',
'KMeans.Cluster.Inertia',
'KMeans.Cluster.Silhouette.Score'])
kmeans_clustering_evaluation_summary
| KMeans.Cluster.Count | KMeans.Cluster.Inertia | KMeans.Cluster.Silhouette.Score | |
|---|---|---|---|
| 0 | 2 | 1238.4894 | 0.2355 |
| 1 | 3 | 1027.3347 | 0.2330 |
| 2 | 4 | 948.1192 | 0.2323 |
| 3 | 5 | 897.3084 | 0.1608 |
| 4 | 6 | 821.6682 | 0.1576 |
| 5 | 7 | 771.4820 | 0.1627 |
| 6 | 8 | 725.5394 | 0.1633 |
| 7 | 9 | 670.6289 | 0.1836 |
###################################
# Plotting the Inertia performance
# by cluster count using a range of K values
# for the K-Means Clustering algorithm
# for the complete dataset
##################################
kmeans_cluster_count_values = np.array(kmeans_clustering_evaluation_summary['KMeans.Cluster.Count'].values)
kmeans_inertia_values = np.array(kmeans_clustering_evaluation_summary['KMeans.Cluster.Inertia'].values)
plt.figure(figsize=(10, 6))
plt.plot(kmeans_cluster_count_values, kmeans_inertia_values, marker='o',ls='-')
plt.grid(True)
plt.ylim(500,1500)
plt.title("K-Means Clustering Algorithm: Cluster Count by Inertia")
plt.xlabel("Cluster")
plt.ylabel("Inertia")
plt.show()
###################################
# Plotting the Silhouette Score performance
# by cluster count using a range of K values
# for the K-Means Clustering algorithm
# for the complete dataset
##################################
kmeans_cluster_count_values = np.array(kmeans_clustering_evaluation_summary['KMeans.Cluster.Count'].values)
kmeans_silhouette_score_values = np.array(kmeans_clustering_evaluation_summary['KMeans.Cluster.Silhouette.Score'].values)
plt.figure(figsize=(10, 6))
plt.plot(kmeans_cluster_count_values, kmeans_silhouette_score_values, marker='o',ls='-')
plt.grid(True)
plt.ylim(0,0.3)
plt.title("K-Means Clustering Algorithm: Cluster Count by Silhouette Score")
plt.xlabel("Cluster")
plt.ylabel("Silhouette Score")
plt.show()
###################################
# Formulating the final K-Means Clustering model
# using the optimal cluster count
##################################
kmeans_clustering = KMeans(n_clusters=2,
random_state=88888888,
n_init='auto',
init='k-means++')
kmeans_clustering = kmeans_clustering.fit(cancer_death_rate_premodelling_clustering)
###################################
# Gathering the Inertia and Silhouette Score
# for the final K-Means Clustering model
##################################
kmeans_clustering_inertia = kmeans_clustering.inertia_
kmeans_clustering_silhouette_score = silhouette_score(cancer_death_rate_premodelling_clustering,
kmeans_clustering.predict(cancer_death_rate_premodelling_clustering),
metric='euclidean')
##################################
# Plotting the cluster labels
# for the final K-Means Clustering model
##################################
cancer_death_rate_kmeans_clustering = cancer_death_rate_premodelling_clustering.copy()
cancer_death_rate_kmeans_clustering['KMEANS_CLUSTER'] = kmeans_clustering.predict(cancer_death_rate_kmeans_clustering)
cancer_death_rate_kmeans_clustering.head()
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | KMEANS_CLUSTER | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.6922 | -0.4550 | -0.1771 | 2.0964 | 0.9425 | -1.4794 | -0.6095 | -0.9258 | 1.4059 | 1 |
| 1 | -0.0867 | -1.3608 | -1.1020 | 0.3084 | -1.4329 | 0.2506 | 0.8754 | -0.7177 | 0.8924 | 0 |
| 2 | -1.0261 | -0.8704 | -0.8184 | -1.2331 | -1.8001 | -0.6858 | -0.9625 | -1.0428 | -1.1914 | 0 |
| 3 | 0.0691 | -0.3352 | -0.8866 | -0.1990 | 0.0272 | 1.2933 | 1.3658 | 1.5903 | 1.3091 | 0 |
| 4 | 0.5801 | 0.3703 | 1.0686 | -0.0427 | 1.2150 | -1.1052 | -0.2718 | -0.5826 | -0.8379 | 1 |
##################################
# Gathering the pairplot for all variables
# labelled using the final K-Means Clustering model
##################################
cancer_death_rate_kmeans_clustering_plot = sns.pairplot(cancer_death_rate_kmeans_clustering,
kind='reg',
markers=["o", "s"],
plot_kws={'scatter_kws': {'alpha': 0.3}},
hue='KMEANS_CLUSTER');
sns.move_legend(cancer_death_rate_kmeans_clustering_plot,
"lower center",
bbox_to_anchor=(.5, 1), ncol=2, title='KMEANS_CLUSTER', frameon=False)
plt.show()
##################################
# Gathering the pairplot for all variables
# labelled using the final K-Means Clustering model
##################################
cancer_death_rate_kmeans_clustering_plot = sns.pairplot(cancer_death_rate_kmeans_clustering,
kind='kde',
hue='KMEANS_CLUSTER');
sns.move_legend(cancer_death_rate_kmeans_clustering_plot,
"lower center",
bbox_to_anchor=(.5, 1), ncol=2, title='KMEANS_CLUSTER', frameon=False)
plt.show()
##################################
# Preparing the cross-validation data
# and parameters to be evaluated
# for the K-Means Clustering algorithm
##################################
X = cancer_death_rate_premodelling_clustering.copy()
bisecting_kmeans_kfold_cluster_list = range(2, 10)
bisecting_kmeans_kfold_cluster_silhouette_score = []
##################################
# Conducting the 5-fold cross-validation
# using the defined parameters
# for the Bisecting K-Means Clustering algorithm
# for each individual cluster count
##################################
for k in bisecting_kmeans_kfold_cluster_list:
##################################
# Defining the hyperparameters
##################################
bk = BisectingKMeans(n_clusters=k,
random_state=88888888,
n_init=1,
init='k-means++')
##################################
# Defining the 5-fold groups
##################################
kfold = KFold(n_splits=5,
shuffle=True,
random_state=88888888)
scores = []
for train_index, test_index in kfold.split(X):
##################################
# Formulating 5-fold groups
##################################
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
##################################
# Fitting the Bisecting K-Means Clustering algorithm
# on the train data
##################################
bk.fit(X_train)
##################################
# Assigning clusters for the test data
##################################
labels = bk.predict(X_test)
##################################
# Computing for the silhouette score
# on the test data
##################################
score = silhouette_score(X_test, labels)
scores.append(score)
##################################
# Calculating the average silhouette score
# for the given cluster count
##################################
average_score = np.mean(scores)
bisecting_kmeans_kfold_cluster_silhouette_score.append(average_score)
##################################
# Consolidating the model performance metrics
# for the Bisecting K-Means Clustering algorithm
# using a range of K values
##################################
bisecting_kmeans_clustering_kfold_summary = pd.DataFrame(zip(bisecting_kmeans_kfold_cluster_list,
bisecting_kmeans_kfold_cluster_silhouette_score),
columns=['Bisecting.KMeans.KFold.Cluster.Count',
'Bisecting.KMeans.KFold.Cluster.Average.Silhouette.Score'])
bisecting_kmeans_clustering_kfold_summary
| Bisecting.KMeans.KFold.Cluster.Count | Bisecting.KMeans.KFold.Cluster.Average.Silhouette.Score | |
|---|---|---|
| 0 | 2 | 0.2315 |
| 1 | 3 | 0.1953 |
| 2 | 4 | 0.1669 |
| 3 | 5 | 0.1375 |
| 4 | 6 | 0.1250 |
| 5 | 7 | 0.1161 |
| 6 | 8 | 0.1100 |
| 7 | 9 | 0.1053 |
###################################
# Plotting the Average Silhouette Score performance
# by cluster count using the 5-fold results
# for the Bisecting K-Means Clustering algorithm
##################################
bisecting_kmeans_kfold_cluster_count_values = np.array(bisecting_kmeans_clustering_kfold_summary['Bisecting.KMeans.KFold.Cluster.Count'].values)
bisecting_kmeans_kfold_silhouette_score_values = np.array(bisecting_kmeans_clustering_kfold_summary['Bisecting.KMeans.KFold.Cluster.Average.Silhouette.Score'].values)
plt.figure(figsize=(10, 6))
plt.plot(bisecting_kmeans_kfold_cluster_count_values, bisecting_kmeans_kfold_silhouette_score_values, marker='o',ls='-')
plt.grid(True)
plt.ylim(0,0.3)
plt.title("Bisecting K-Means Clustering Algorithm: Cluster Count by Cross-Validated Silhouette Score")
plt.xlabel("Cluster")
plt.ylabel("Average Silhouette Score")
plt.show()
##################################
# Fitting the Bisecting K-Means Clustering algorithm
# using a range of K values
# for the complete dataset
##################################
bisecting_kmeans_cluster_list = list()
bisecting_kmeans_cluster_inertia = list()
bisecting_kmeans_cluster_silhouette_score = list()
for cluster_count in range(2,10):
bk = BisectingKMeans(n_clusters=cluster_count,
random_state=88888888,
n_init=1,
init='k-means++')
bk = bk.fit(cancer_death_rate_premodelling_clustering)
bisecting_kmeans_cluster_list.append(cluster_count)
bisecting_kmeans_cluster_inertia.append(bk.inertia_)
bisecting_kmeans_cluster_silhouette_score.append(silhouette_score(cancer_death_rate_premodelling_clustering,
bk.predict(cancer_death_rate_premodelling_clustering),
metric='euclidean'))
##################################
# Consolidating the model performance metrics
# for the Bisecting K-Means Clustering algorithm
# using a range of K values
# for the complete dataset
##################################
bisecting_kmeans_clustering_evaluation_summary = pd.DataFrame(zip(bisecting_kmeans_cluster_list,
bisecting_kmeans_cluster_inertia,
bisecting_kmeans_cluster_silhouette_score),
columns=['Bisecting.KMeans.Cluster.Count',
'Bisecting.KMeans.Cluster.Inertia',
'Bisecting.KMeans.Cluster.Silhouette.Score'])
bisecting_kmeans_clustering_evaluation_summary
| Bisecting.KMeans.Cluster.Count | Bisecting.KMeans.Cluster.Inertia | Bisecting.KMeans.Cluster.Silhouette.Score | |
|---|---|---|---|
| 0 | 2 | 1238.4894 | 0.2355 |
| 1 | 3 | 1080.6399 | 0.2146 |
| 2 | 4 | 955.1301 | 0.1887 |
| 3 | 5 | 891.9650 | 0.1762 |
| 4 | 6 | 843.0145 | 0.1750 |
| 5 | 7 | 798.7791 | 0.1341 |
| 6 | 8 | 758.0470 | 0.1413 |
| 7 | 9 | 714.1712 | 0.1503 |
###################################
# Plotting the Inertia performance
# by cluster count using a range of K values
# for the Bisecting K-Means Clustering algorithm
# for the complete dataset
##################################
bisecting_kmeans_cluster_count_values = np.array(bisecting_kmeans_clustering_evaluation_summary['Bisecting.KMeans.Cluster.Count'].values)
bisecting_kmeans_inertia_values = np.array(bisecting_kmeans_clustering_evaluation_summary['Bisecting.KMeans.Cluster.Inertia'].values)
plt.figure(figsize=(10, 6))
plt.plot(bisecting_kmeans_cluster_count_values, bisecting_kmeans_inertia_values, marker='o',ls='-')
plt.grid(True)
plt.ylim(500,1500)
plt.title("Bisecting K-Means Clustering Algorithm: Cluster Count by Inertia")
plt.xlabel("Cluster")
plt.ylabel("Inertia")
plt.show()
###################################
# Plotting the Silhouette Score performance
# by cluster count using a range of K values
# for the Bisecting K-Means Clustering algorithm
# for the complete dataset
##################################
bisecting_kmeans_cluster_count_values = np.array(bisecting_kmeans_clustering_evaluation_summary['Bisecting.KMeans.Cluster.Count'].values)
bisecting_kmeans_silhouette_score_values = np.array(bisecting_kmeans_clustering_evaluation_summary['Bisecting.KMeans.Cluster.Silhouette.Score'].values)
plt.figure(figsize=(10, 6))
plt.plot(bisecting_kmeans_cluster_count_values, bisecting_kmeans_silhouette_score_values, marker='o',ls='-')
plt.grid(True)
plt.ylim(0,0.3)
plt.title("Bisecting K-Means Clustering Algorithm: Cluster Count by Silhouette Score")
plt.xlabel("Cluster")
plt.ylabel("Silhouette Score")
plt.show()
###################################
# Formulating the final Bisecting K-Means Clustering model
# using the optimal cluster count
##################################
bisecting_kmeans_clustering = BisectingKMeans(n_clusters=2,
random_state=88888888,
n_init=1,
init='k-means++')
bisecting_kmeans_clustering = bisecting_kmeans_clustering.fit(cancer_death_rate_premodelling_clustering)
###################################
# Gathering the Inertia and Silhouette Score
# for the final Bisecting K-Means Clustering model
##################################
bisecting_kmeans_clustering_inertia = bisecting_kmeans_clustering.inertia_
bisecting_kmeans_clustering_silhouette_score = silhouette_score(cancer_death_rate_premodelling_clustering,
bisecting_kmeans_clustering.predict(cancer_death_rate_premodelling_clustering),
metric='euclidean')
##################################
# Plotting the cluster labels
# for the final Bisecting K-Means Clustering model
##################################
cancer_death_rate_bisecting_kmeans_clustering = cancer_death_rate_premodelling_clustering.copy()
cancer_death_rate_bisecting_kmeans_clustering['BISECTING_KMEANS_CLUSTER'] = bisecting_kmeans_clustering.predict(cancer_death_rate_bisecting_kmeans_clustering)
cancer_death_rate_bisecting_kmeans_clustering.head()
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | BISECTING_KMEANS_CLUSTER | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.6922 | -0.4550 | -0.1771 | 2.0964 | 0.9425 | -1.4794 | -0.6095 | -0.9258 | 1.4059 | 1 |
| 1 | -0.0867 | -1.3608 | -1.1020 | 0.3084 | -1.4329 | 0.2506 | 0.8754 | -0.7177 | 0.8924 | 0 |
| 2 | -1.0261 | -0.8704 | -0.8184 | -1.2331 | -1.8001 | -0.6858 | -0.9625 | -1.0428 | -1.1914 | 0 |
| 3 | 0.0691 | -0.3352 | -0.8866 | -0.1990 | 0.0272 | 1.2933 | 1.3658 | 1.5903 | 1.3091 | 0 |
| 4 | 0.5801 | 0.3703 | 1.0686 | -0.0427 | 1.2150 | -1.1052 | -0.2718 | -0.5826 | -0.8379 | 1 |
##################################
# Gathering the pairplot for all variables
# labelled using the final Bisecting K-Means Clustering model
##################################
cancer_death_rate_bisecting_kmeans_clustering_plot = sns.pairplot(cancer_death_rate_bisecting_kmeans_clustering,
kind='reg',
markers=["o", "s"],
plot_kws={'scatter_kws': {'alpha': 0.3}},
hue='BISECTING_KMEANS_CLUSTER');
sns.move_legend(cancer_death_rate_bisecting_kmeans_clustering_plot,
"lower center",
bbox_to_anchor=(.5, 1), ncol=2, title='BISECTING_KMEANS_CLUSTER', frameon=False)
plt.show()
##################################
# Gathering the pairplot for all variables
# labelled using the final Bisecting K-Means Clustering model
##################################
cancer_death_rate_bisecting_kmeans_clustering_plot = sns.pairplot(cancer_death_rate_bisecting_kmeans_clustering,
kind='kde',
hue='BISECTING_KMEANS_CLUSTER');
sns.move_legend(cancer_death_rate_bisecting_kmeans_clustering_plot,
"lower center",
bbox_to_anchor=(.5, 1), ncol=2, title='BISECTING_KMEANS_CLUSTER', frameon=False)
plt.show()
##################################
# Preparing the cross-validation data
# and parameters to be evaluated
# for the GMM Clustering algorithm
##################################
X = cancer_death_rate_premodelling_clustering.copy()
gaussian_mixture_kfold_cluster_list = range(2, 10)
gaussian_mixture_kfold_cluster_silhouette_score = []
##################################
# Conducting the 5-fold cross-validation
# using the defined parameters
# for the GMM Clustering algorithm
# for each individual cluster count
##################################
for k in gaussian_mixture_kfold_cluster_list:
##################################
# Defining the hyperparameters
##################################
gm = GaussianMixture(n_components=k,
init_params='k-means++',
covariance_type='full',
tol = 1e-3,
random_state=88888888)
##################################
# Defining the 5-fold groups
##################################
kfold = KFold(n_splits=5,
shuffle=True,
random_state=88888888)
scores = []
for train_index, test_index in kfold.split(X):
##################################
# Formulating 5-fold groups
##################################
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
##################################
# Fitting the GMM Clustering algorithm
# on the train data
##################################
gm.fit(X_train)
##################################
# Assigning clusters for the test data
##################################
labels = gm.predict(X_test)
##################################
# Computing for the silhouette score
# on the test data
##################################
score = silhouette_score(X_test, labels)
scores.append(score)
##################################
# Calculating the average silhouette score
# for the given cluster count
##################################
average_score = np.mean(scores)
gaussian_mixture_kfold_cluster_silhouette_score.append(average_score)
##################################
# Consolidating the model performance metrics
# for the GMM Clustering algorithm
# using a range of K values
##################################
gaussian_mixture_clustering_kfold_summary = pd.DataFrame(zip(gaussian_mixture_kfold_cluster_list,
gaussian_mixture_kfold_cluster_silhouette_score),
columns=['GMM.KFold.Cluster.Count',
'GMM.KFold.Cluster.Average.Silhouette.Score'])
gaussian_mixture_clustering_kfold_summary
| GMM.KFold.Cluster.Count | GMM.KFold.Cluster.Average.Silhouette.Score | |
|---|---|---|
| 0 | 2 | 0.1289 |
| 1 | 3 | 0.0730 |
| 2 | 4 | 0.0814 |
| 3 | 5 | 0.0451 |
| 4 | 6 | 0.0644 |
| 5 | 7 | 0.0631 |
| 6 | 8 | 0.0353 |
| 7 | 9 | 0.0144 |
###################################
# Plotting the Average Silhouette Score performance
# by cluster count using the 5-fold results
# for the GMM Clustering algorithm
##################################
gaussian_mixture_kfold_cluster_count_values = np.array(gaussian_mixture_clustering_kfold_summary['GMM.KFold.Cluster.Count'].values)
gaussian_mixture_kfold_silhouette_score_values = np.array(gaussian_mixture_clustering_kfold_summary['GMM.KFold.Cluster.Average.Silhouette.Score'].values)
plt.figure(figsize=(10, 6))
plt.plot(gaussian_mixture_kfold_cluster_count_values, gaussian_mixture_kfold_silhouette_score_values, marker='o',ls='-')
plt.grid(True)
plt.ylim(0,0.3)
plt.title("GMM Clustering Algorithm: Cluster Count by Cross-Validated Silhouette Score")
plt.xlabel("Cluster")
plt.ylabel("Average Silhouette Score")
plt.show()
##################################
# Fitting the GMM Clustering algorithm
# using a range of K values
# for the complete dataset
##################################
gaussian_mixture_cluster_list = list()
gaussian_mixture_cluster_silhouette_score = list()
for cluster_count in range(2,10):
gm = GaussianMixture(n_components=cluster_count,
init_params='k-means++',
covariance_type='full',
tol = 1e-3,
random_state=88888888)
gm = gm.fit(cancer_death_rate_premodelling_clustering)
gaussian_mixture_cluster_list.append(cluster_count)
gaussian_mixture_cluster_silhouette_score.append(silhouette_score(cancer_death_rate_premodelling_clustering,
gm.predict(cancer_death_rate_premodelling_clustering),
metric='euclidean'))
##################################
# Consolidating the model performance metrics
# for the GMM Clustering algorithm
# using a range of K values
# for the complete dataset
##################################
gaussian_mixture_clustering_evaluation_summary = pd.DataFrame(zip(gaussian_mixture_cluster_list,
gaussian_mixture_cluster_silhouette_score),
columns=['GMM.Cluster.Count',
'GMM.Cluster.Silhouette.Score'])
gaussian_mixture_clustering_evaluation_summary
| GMM.Cluster.Count | GMM.Cluster.Silhouette.Score | |
|---|---|---|
| 0 | 2 | 0.2239 |
| 1 | 3 | 0.2235 |
| 2 | 4 | 0.2026 |
| 3 | 5 | 0.1205 |
| 4 | 6 | 0.1208 |
| 5 | 7 | 0.1266 |
| 6 | 8 | 0.1320 |
| 7 | 9 | 0.1348 |
###################################
# Plotting the Silhouette Score performance
# by cluster count using a range of K values
# for the GMM Clustering algorithm
# for the complete dataset
##################################
gaussian_mixture_cluster_count_values = np.array(gaussian_mixture_clustering_evaluation_summary['GMM.Cluster.Count'].values)
gaussian_mixture_silhouette_score_values = np.array(gaussian_mixture_clustering_evaluation_summary['GMM.Cluster.Silhouette.Score'].values)
plt.figure(figsize=(10, 6))
plt.plot(gaussian_mixture_cluster_count_values, gaussian_mixture_silhouette_score_values, marker='o',ls='-')
plt.grid(True)
plt.ylim(0,0.3)
plt.title("GMM Clustering Algorithm: Cluster Count by Silhouette Score")
plt.xlabel("Cluster")
plt.ylabel("Silhouette Score")
plt.show()
###################################
# Formulating the final GMM Clustering model
# using the optimal cluster count
##################################
gaussian_mixture_clustering = GaussianMixture(n_components=2,
init_params='k-means++',
random_state=88888888)
gaussian_mixture_clustering = gaussian_mixture_clustering.fit(cancer_death_rate_premodelling_clustering)
###################################
# Gathering the Silhouette Score
# for the final GMM Clustering model
##################################
gaussian_mixture_clustering_silhouette_score = silhouette_score(cancer_death_rate_premodelling_clustering,
gaussian_mixture_clustering.predict(cancer_death_rate_premodelling_clustering),
metric='euclidean')
##################################
# Plotting the cluster labels
# for the final GMM Clustering model
##################################
cancer_death_rate_gaussian_mixture_clustering = cancer_death_rate_premodelling_clustering.copy()
cancer_death_rate_gaussian_mixture_clustering['GMM_CLUSTER'] = gaussian_mixture_clustering.predict(cancer_death_rate_gaussian_mixture_clustering)
cancer_death_rate_gaussian_mixture_clustering.head()
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | GMM_CLUSTER | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.6922 | -0.4550 | -0.1771 | 2.0964 | 0.9425 | -1.4794 | -0.6095 | -0.9258 | 1.4059 | 1 |
| 1 | -0.0867 | -1.3608 | -1.1020 | 0.3084 | -1.4329 | 0.2506 | 0.8754 | -0.7177 | 0.8924 | 0 |
| 2 | -1.0261 | -0.8704 | -0.8184 | -1.2331 | -1.8001 | -0.6858 | -0.9625 | -1.0428 | -1.1914 | 0 |
| 3 | 0.0691 | -0.3352 | -0.8866 | -0.1990 | 0.0272 | 1.2933 | 1.3658 | 1.5903 | 1.3091 | 0 |
| 4 | 0.5801 | 0.3703 | 1.0686 | -0.0427 | 1.2150 | -1.1052 | -0.2718 | -0.5826 | -0.8379 | 1 |
##################################
# Gathering the pairplot for all variables
# labelled using the final GMM Clustering model
##################################
cancer_death_rate_gaussian_mixture_clustering_plot = sns.pairplot(cancer_death_rate_gaussian_mixture_clustering,
kind='reg',
markers=["o", "s"],
plot_kws={'scatter_kws': {'alpha': 0.3}},
hue='GMM_CLUSTER');
sns.move_legend(cancer_death_rate_gaussian_mixture_clustering_plot,
"lower center",
bbox_to_anchor=(.5, 1), ncol=2, title='GMM_CLUSTER', frameon=False)
plt.show()
##################################
# Gathering the pairplot for all variables
# labelled using the final GMM Clustering model
##################################
cancer_death_rate_gaussian_mixture_clustering_plot = sns.pairplot(cancer_death_rate_gaussian_mixture_clustering,
kind='kde',
hue='GMM_CLUSTER');
sns.move_legend(cancer_death_rate_gaussian_mixture_clustering_plot,
"lower center",
bbox_to_anchor=(.5, 1), ncol=2, title='GMM_CLUSTER', frameon=False)
plt.show()
##################################
# Preparing the cross-validation data
# and parameters to be evaluated
# for the Agglomerative Clustering algorithm
##################################
X = cancer_death_rate_premodelling_clustering.copy()
agglomerative_kfold_cluster_list = range(2, 10)
agglomerative_kfold_cluster_silhouette_score = []
##################################
# Conducting the 5-fold cross-validation
# using the defined parameters
# for the Agglomerative Clustering algorithm
# for each individual cluster count
##################################
for k in agglomerative_kfold_cluster_list:
##################################
# Defining the hyperparameters
##################################
ag = AgglomerativeClustering(n_clusters=k,
linkage='complete')
##################################
# Defining the 5-fold groups
##################################
kfold = KFold(n_splits=5,
shuffle=True,
random_state=88888888)
scores = []
for train_index, test_index in kfold.split(X):
##################################
# Formulating 5-fold groups
##################################
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
##################################
# Fitting the Agglomerative Clustering algorithm
# on the train data
##################################
ag.fit(X_train)
##################################
# Assigning clusters for the test data
##################################
labels = ag.fit_predict(X_test)
##################################
# Computing for the silhouette score
# on the test data
##################################
score = silhouette_score(X_test, labels)
scores.append(score)
##################################
# Calculating the average silhouette score
# for the given cluster count
##################################
average_score = np.mean(scores)
agglomerative_kfold_cluster_silhouette_score.append(average_score)
##################################
# Consolidating the model performance metrics
# for the Agglomerative Clustering algorithm
# using a range of K values
##################################
agglomerative_clustering_kfold_summary = pd.DataFrame(zip(agglomerative_kfold_cluster_list,
agglomerative_kfold_cluster_silhouette_score),
columns=['Agglomerative.KFold.Cluster.Count',
'Agglomerative.KFold.Cluster.Average.Silhouette.Score'])
agglomerative_clustering_kfold_summary
| Agglomerative.KFold.Cluster.Count | Agglomerative.KFold.Cluster.Average.Silhouette.Score | |
|---|---|---|
| 0 | 2 | 0.2232 |
| 1 | 3 | 0.2336 |
| 2 | 4 | 0.2285 |
| 3 | 5 | 0.2339 |
| 4 | 6 | 0.2256 |
| 5 | 7 | 0.2083 |
| 6 | 8 | 0.2231 |
| 7 | 9 | 0.2217 |
###################################
# Plotting the Average Silhouette Score performance
# by cluster count using the 5-fold results
# for the Agglomerative Clustering algorithm
##################################
agglomerative_kfold_cluster_count_values = np.array(agglomerative_clustering_kfold_summary['Agglomerative.KFold.Cluster.Count'].values)
agglomerative_kfold_silhouette_score_values = np.array(agglomerative_clustering_kfold_summary['Agglomerative.KFold.Cluster.Average.Silhouette.Score'].values)
plt.figure(figsize=(10, 6))
plt.plot(agglomerative_kfold_cluster_count_values, agglomerative_kfold_silhouette_score_values, marker='o',ls='-')
plt.grid(True)
plt.ylim(0,0.3)
plt.title("Agglomerative Clustering Algorithm: Cluster Count by Cross-Validated Silhouette Score")
plt.xlabel("Cluster")
plt.ylabel("Average Silhouette Score")
plt.show()
##################################
# Fitting the Agglomerative Clustering algorithm
# using a range of K values
# for the complete dataset
##################################
agglomerative_cluster_list = list()
agglomerative_cluster_silhouette_score = list()
for cluster_count in range(2,10):
ag = AgglomerativeClustering(n_clusters=cluster_count,
linkage='complete')
ag = ag.fit(cancer_death_rate_premodelling_clustering)
agglomerative_cluster_list.append(cluster_count)
agglomerative_cluster_silhouette_score.append(silhouette_score(cancer_death_rate_premodelling_clustering,
ag.fit_predict(cancer_death_rate_premodelling_clustering),
metric='euclidean'))
##################################
# Consolidating the model performance metrics
# for the Agglomerative Clustering algorithm
# using a range of K values
# for the complete dataset
##################################
agglomerative_clustering_evaluation_summary = pd.DataFrame(zip(agglomerative_cluster_list,
agglomerative_cluster_silhouette_score),
columns=['Agglomerative.Cluster.Count',
'Agglomerative.Cluster.Silhouette.Score'])
agglomerative_clustering_evaluation_summary
| Agglomerative.Cluster.Count | Agglomerative.Cluster.Silhouette.Score | |
|---|---|---|
| 0 | 2 | 0.1629 |
| 1 | 3 | 0.1311 |
| 2 | 4 | 0.1127 |
| 3 | 5 | 0.1617 |
| 4 | 6 | 0.2035 |
| 5 | 7 | 0.1995 |
| 6 | 8 | 0.2006 |
| 7 | 9 | 0.1968 |
###################################
# Plotting the Silhouette Score performance
# by cluster count using a range of K values
# for the Agglomerative Clustering algorithm
# for the complete dataset
##################################
agglomerative_cluster_count_values = np.array(agglomerative_clustering_evaluation_summary['Agglomerative.Cluster.Count'].values)
agglomerative_silhouette_score_values = np.array(agglomerative_clustering_evaluation_summary['Agglomerative.Cluster.Silhouette.Score'].values)
plt.figure(figsize=(10, 6))
plt.plot(agglomerative_cluster_count_values, agglomerative_silhouette_score_values, marker='o',ls='-')
plt.grid(True)
plt.ylim(0,0.3)
plt.title("Agglomerative Clustering Algorithm: Cluster Count by Silhouette Score")
plt.xlabel("Cluster")
plt.ylabel("Silhouette Score")
plt.show()
###################################
# Formulating the final Agglomerative Clustering model
# using the optimal cluster count
##################################
agglomerative_clustering = AgglomerativeClustering(n_clusters=5,
linkage='complete')
agglomerative_clustering = agglomerative_clustering.fit(cancer_death_rate_premodelling_clustering)
###################################
# Gathering the Silhouette Score
# for the final K-Means Clustering model
##################################
agglomerative_clustering_silhouette_score = silhouette_score(cancer_death_rate_premodelling_clustering, agglomerative_clustering.labels_, metric='euclidean')
##################################
# Plotting the cluster labels
# for the final Agglomerative Clustering model
##################################
cancer_death_rate_agglomerative_clustering = cancer_death_rate_premodelling_clustering.copy()
cancer_death_rate_agglomerative_clustering['AGGLOMERATIVE_CLUSTER'] = agglomerative_clustering.fit_predict(cancer_death_rate_agglomerative_clustering)
cancer_death_rate_agglomerative_clustering.head()
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | AGGLOMERATIVE_CLUSTER | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.6922 | -0.4550 | -0.1771 | 2.0964 | 0.9425 | -1.4794 | -0.6095 | -0.9258 | 1.4059 | 1 |
| 1 | -0.0867 | -1.3608 | -1.1020 | 0.3084 | -1.4329 | 0.2506 | 0.8754 | -0.7177 | 0.8924 | 0 |
| 2 | -1.0261 | -0.8704 | -0.8184 | -1.2331 | -1.8001 | -0.6858 | -0.9625 | -1.0428 | -1.1914 | 4 |
| 3 | 0.0691 | -0.3352 | -0.8866 | -0.1990 | 0.0272 | 1.2933 | 1.3658 | 1.5903 | 1.3091 | 0 |
| 4 | 0.5801 | 0.3703 | 1.0686 | -0.0427 | 1.2150 | -1.1052 | -0.2718 | -0.5826 | -0.8379 | 2 |
##################################
# Gathering the pairplot for all variables
# labelled using the final Agglomerative Clustering model
##################################
cancer_death_rate_agglomerative_clustering_plot = sns.pairplot(cancer_death_rate_agglomerative_clustering,
kind='reg',
markers=['o', 's','X','D','P'],
plot_kws={'scatter_kws': {'alpha': 0.3}},
hue='AGGLOMERATIVE_CLUSTER');
sns.move_legend(cancer_death_rate_agglomerative_clustering_plot,
"lower center",
bbox_to_anchor=(.5, 1), ncol=5, title='AGGLOMERATIVE_CLUSTER', frameon=False)
plt.show()
##################################
# Gathering the pairplot for all variables
# labelled using the final Agglomerative Clustering model
##################################
cancer_death_rate_agglomerative_clustering_plot = sns.pairplot(cancer_death_rate_agglomerative_clustering,
kind='kde',
hue='AGGLOMERATIVE_CLUSTER');
sns.move_legend(cancer_death_rate_agglomerative_clustering_plot,
"lower center",
bbox_to_anchor=(.5, 1), ncol=5, title='AGGLOMERATIVE_CLUSTER', frameon=False)
plt.show()
##################################
# Preparing the cross-validation data
# and parameters to be evaluated
# for the Ward Hierarchical Clustering algorithm
##################################
X = cancer_death_rate_premodelling_clustering.copy()
ward_hierarchical_kfold_cluster_list = range(2, 10)
ward_hierarchical_kfold_cluster_silhouette_score = []
##################################
# Conducting the 5-fold cross-validation
# using the defined parameters
# for the Ward Hierarchical Clustering algorithm
# for each individual cluster count
##################################
for k in ward_hierarchical_kfold_cluster_list:
##################################
# Defining the hyperparameters
##################################
wh = AgglomerativeClustering(n_clusters=k,
linkage='ward')
##################################
# Defining the 5-fold groups
##################################
kfold = KFold(n_splits=5,
shuffle=True,
random_state=88888888)
scores = []
for train_index, test_index in kfold.split(X):
##################################
# Formulating 5-fold groups
##################################
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
##################################
# Fitting the Ward Hierarchical Clustering algorithm
# on the train data
##################################
wh.fit(X_train)
##################################
# Assigning clusters for the test data
##################################
labels = wh.fit_predict(X_test)
##################################
# Computing for the silhouette score
# on the test data
##################################
score = silhouette_score(X_test, labels)
scores.append(score)
##################################
# Calculating the average silhouette score
# for the given cluster count
##################################
average_score = np.mean(scores)
ward_hierarchical_kfold_cluster_silhouette_score.append(average_score)
##################################
# Consolidating the model performance metrics
# for the Ward Hierarchical Clustering algorithm
# using a range of K values
##################################
ward_hierarchical_clustering_kfold_summary = pd.DataFrame(zip(ward_hierarchical_kfold_cluster_list,
ward_hierarchical_kfold_cluster_silhouette_score),
columns=['Ward.Hierarchical.KFold.Cluster.Count',
'Ward.Hierarchical.KFold.Cluster.Average.Silhouette.Score'])
ward_hierarchical_clustering_kfold_summary
| Ward.Hierarchical.KFold.Cluster.Count | Ward.Hierarchical.KFold.Cluster.Average.Silhouette.Score | |
|---|---|---|
| 0 | 2 | 0.2173 |
| 1 | 3 | 0.2309 |
| 2 | 4 | 0.2312 |
| 3 | 5 | 0.2202 |
| 4 | 6 | 0.2269 |
| 5 | 7 | 0.2330 |
| 6 | 8 | 0.2411 |
| 7 | 9 | 0.2492 |
###################################
# Plotting the Average Silhouette Score performance
# by cluster count using the 5-fold results
# for the Ward Hierarchical Clustering algorithm
##################################
ward_hierarchical_kfold_cluster_count_values = np.array(ward_hierarchical_clustering_kfold_summary['Ward.Hierarchical.KFold.Cluster.Count'].values)
ward_hierarchical_kfold_silhouette_score_values = np.array(ward_hierarchical_clustering_kfold_summary['Ward.Hierarchical.KFold.Cluster.Average.Silhouette.Score'].values)
plt.figure(figsize=(10, 6))
plt.plot(ward_hierarchical_kfold_cluster_count_values, ward_hierarchical_kfold_silhouette_score_values, marker='o',ls='-')
plt.grid(True)
plt.ylim(0,0.3)
plt.title("Ward Hierarchical Clustering Algorithm: Cluster Count by Cross-Validated Silhouette Score")
plt.xlabel("Cluster")
plt.ylabel("Average Silhouette Score")
plt.show()
##################################
# Fitting the Ward Hierarchical Clustering algorithm
# using a range of K values
##################################
ward_hierarchical_cluster_list = list()
ward_hierarchical_cluster_silhouette_score = list()
for cluster_count in range(2,10):
wh = AgglomerativeClustering(n_clusters=cluster_count,
linkage='ward')
wh = wh.fit(cancer_death_rate_premodelling_clustering)
ward_hierarchical_cluster_list.append(cluster_count)
ward_hierarchical_cluster_silhouette_score.append(silhouette_score(cancer_death_rate_premodelling_clustering,
wh.fit_predict(cancer_death_rate_premodelling_clustering),
metric='euclidean'))
##################################
# Consolidating the model performance metrics
# for the Ward Hierarchical Clustering algorithm
# using a range of K values
##################################
ward_hierarchical_clustering_evaluation_summary = pd.DataFrame(zip(ward_hierarchical_cluster_list,
ward_hierarchical_cluster_silhouette_score),
columns=['Ward.Hierarchical.Cluster.Count',
'Ward.Hierarchical.Cluster.Silhouette.Score'])
ward_hierarchical_clustering_evaluation_summary
| Ward.Hierarchical.Cluster.Count | Ward.Hierarchical.Cluster.Silhouette.Score | |
|---|---|---|
| 0 | 2 | 0.2148 |
| 1 | 3 | 0.1924 |
| 2 | 4 | 0.1840 |
| 3 | 5 | 0.1714 |
| 4 | 6 | 0.1858 |
| 5 | 7 | 0.1803 |
| 6 | 8 | 0.1595 |
| 7 | 9 | 0.1689 |
###################################
# Plotting the Silhouette Score performance
# by cluster count using a range of K values
# for the Ward Hierarchical Clustering algorithm
##################################
ward_hierarchical_cluster_count_values = np.array(ward_hierarchical_clustering_evaluation_summary['Ward.Hierarchical.Cluster.Count'].values)
ward_hierarchical_silhouette_score_values = np.array(ward_hierarchical_clustering_evaluation_summary['Ward.Hierarchical.Cluster.Silhouette.Score'].values)
plt.figure(figsize=(10, 6))
plt.plot(ward_hierarchical_cluster_count_values, ward_hierarchical_silhouette_score_values, marker='o',ls='-')
plt.grid(True)
plt.ylim(0,0.3)
plt.title("Ward Hierarchical Clustering Algorithm: Cluster Count by Silhouette Score")
plt.xlabel("Cluster")
plt.ylabel("Silhouette Score")
plt.show()
###################################
# Formulating the final Ward Hierarchical Clustering model
# using the optimal cluster count
##################################
ward_hierarchical_clustering = AgglomerativeClustering(n_clusters=9,
linkage='ward')
ward_hierarchical_clustering = ward_hierarchical_clustering.fit(cancer_death_rate_premodelling_clustering)
###################################
# Gathering the Silhouette Score
# for the final Ward Hierarchical model
##################################
ward_hierarchical_clustering_silhouette_score = silhouette_score(cancer_death_rate_premodelling_clustering, ward_hierarchical_clustering.labels_, metric='euclidean')
##################################
# Plotting the cluster labels
# for the final Ward Hierarchical Clustering model
##################################
cancer_death_rate_ward_hierarchical_clustering = cancer_death_rate_premodelling_clustering.copy()
cancer_death_rate_ward_hierarchical_clustering['WARD_HIERARCHICAL_CLUSTER'] = ward_hierarchical_clustering.fit_predict(cancer_death_rate_ward_hierarchical_clustering)
cancer_death_rate_ward_hierarchical_clustering.head()
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | WARD_HIERARCHICAL_CLUSTER | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.6922 | -0.4550 | -0.1771 | 2.0964 | 0.9425 | -1.4794 | -0.6095 | -0.9258 | 1.4059 | 4 |
| 1 | -0.0867 | -1.3608 | -1.1020 | 0.3084 | -1.4329 | 0.2506 | 0.8754 | -0.7177 | 0.8924 | 0 |
| 2 | -1.0261 | -0.8704 | -0.8184 | -1.2331 | -1.8001 | -0.6858 | -0.9625 | -1.0428 | -1.1914 | 0 |
| 3 | 0.0691 | -0.3352 | -0.8866 | -0.1990 | 0.0272 | 1.2933 | 1.3658 | 1.5903 | 1.3091 | 7 |
| 4 | 0.5801 | 0.3703 | 1.0686 | -0.0427 | 1.2150 | -1.1052 | -0.2718 | -0.5826 | -0.8379 | 1 |
##################################
# Gathering the pairplot for all variables
# labelled using the final Ward Hierarchical Clustering model
##################################
cancer_death_rate_ward_hierarchical_clustering_plot = sns.pairplot(cancer_death_rate_ward_hierarchical_clustering,
kind='reg',
markers=['o', 's','X','D','P','*','v','^','h'],
plot_kws={'scatter_kws': {'alpha': 0.3}},
hue='WARD_HIERARCHICAL_CLUSTER');
sns.move_legend(cancer_death_rate_ward_hierarchical_clustering_plot,
"lower center",
bbox_to_anchor=(.5, 1), ncol=9, title='WARD_HIERARCHICAL_CLUSTER', frameon=False)
plt.show()
##################################
# Gathering the pairplot for all variables
# labelled using the final Ward Hierarchical Clustering model
##################################
cancer_death_rate_ward_hierarchical_clustering_plot = sns.pairplot(cancer_death_rate_ward_hierarchical_clustering,
kind='kde',
hue='WARD_HIERARCHICAL_CLUSTER');
sns.move_legend(cancer_death_rate_ward_hierarchical_clustering_plot,
"lower center",
bbox_to_anchor=(.5, 1), ncol=9, title='WARD_HIERARCHICAL_CLUSTER', frameon=False)
plt.show()
##################################
# Consolidating all the
# model performance measures
##################################
clustering_silhouette_score_list = [kmeans_clustering_silhouette_score,
bisecting_kmeans_clustering_silhouette_score,
gaussian_mixture_clustering_silhouette_score,
agglomerative_clustering_silhouette_score,
ward_hierarchical_clustering_silhouette_score]
clustering_silhouette_algorithm_list = ['kmeans_clustering',
'bisecting_kmeans_clustering',
'gaussian_mixture_clustering',
'agglomerative_clustering',
'ward_hierarchical_clustering']
performance_comparison_silhouette_score = pd.DataFrame(zip(clustering_silhouette_algorithm_list,
clustering_silhouette_score_list),
columns=['Clustering.Algorithm',
'Silhouette.Score'])
print('Consolidated Model Performance: ')
display(performance_comparison_silhouette_score)
Consolidated Model Performance:
| Clustering.Algorithm | Silhouette.Score | |
|---|---|---|
| 0 | kmeans_clustering | 0.2355 |
| 1 | bisecting_kmeans_clustering | 0.2355 |
| 2 | gaussian_mixture_clustering | 0.2239 |
| 3 | agglomerative_clustering | 0.1617 |
| 4 | ward_hierarchical_clustering | 0.1689 |
##################################
# Plotting all the Silhouette Score
# model performance measures
##################################
performance_comparison_silhouette_score.set_index('Clustering.Algorithm', inplace=True)
performance_comparison_silhouette_score_plot = performance_comparison_silhouette_score.plot.barh(figsize=(10, 6))
performance_comparison_silhouette_score_plot.set_xlim(0.00,1.00)
performance_comparison_silhouette_score_plot.set_title("Model Comparison by Silhouette Score Performance for Number of Clusters=2")
performance_comparison_silhouette_score_plot.set_xlabel("Silhouette Score Performance")
performance_comparison_silhouette_score_plot.set_ylabel("Clustering Model")
performance_comparison_silhouette_score_plot.grid(False)
performance_comparison_silhouette_score_plot.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
for container in performance_comparison_silhouette_score_plot.containers:
performance_comparison_silhouette_score_plot.bar_label(container, fmt='%.5f', padding=-50, color='white', fontweight='bold')
##################################
# Exploring the selected final model
# using the clustering descriptors
# and K-Means clusters
##################################
cancer_death_rate_kmeans_clustering_descriptor = cancer_death_rate_kmeans_clustering.copy()
cancer_death_rate_kmeans_clustering_descriptor.head()
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | KMEANS_CLUSTER | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.6922 | -0.4550 | -0.1771 | 2.0964 | 0.9425 | -1.4794 | -0.6095 | -0.9258 | 1.4059 | 1 |
| 1 | -0.0867 | -1.3608 | -1.1020 | 0.3084 | -1.4329 | 0.2506 | 0.8754 | -0.7177 | 0.8924 | 0 |
| 2 | -1.0261 | -0.8704 | -0.8184 | -1.2331 | -1.8001 | -0.6858 | -0.9625 | -1.0428 | -1.1914 | 0 |
| 3 | 0.0691 | -0.3352 | -0.8866 | -0.1990 | 0.0272 | 1.2933 | 1.3658 | 1.5903 | 1.3091 | 0 |
| 4 | 0.5801 | 0.3703 | 1.0686 | -0.0427 | 1.2150 | -1.1052 | -0.2718 | -0.5826 | -0.8379 | 1 |
##################################
# Gathering the pairplot for all variables
# labelled using the final K-Means Clustering model
##################################
cancer_death_rate_kmeans_clustering_descriptor_plot = sns.pairplot(cancer_death_rate_kmeans_clustering_descriptor,
kind='reg',
markers=["o", "s"],
plot_kws={'scatter_kws': {'alpha': 0.3}},
hue='KMEANS_CLUSTER');
sns.move_legend(cancer_death_rate_kmeans_clustering_descriptor_plot,
"lower center",
bbox_to_anchor=(.5, 1), ncol=2, title='KMEANS_CLUSTER', frameon=False)
plt.show()
##################################
# Gathering the pairplot for all variables
# labelled using the final K-Means Clustering model
##################################
cancer_death_rate_kmeans_clustering_descriptor_plot = sns.pairplot(cancer_death_rate_kmeans_clustering_descriptor,
kind='kde',
hue='KMEANS_CLUSTER');
sns.move_legend(cancer_death_rate_kmeans_clustering_descriptor_plot,
"lower center",
bbox_to_anchor=(.5, 1), ncol=2, title='KMEANS_CLUSTER', frameon=False)
plt.show()
##################################
# Computing the average descriptors
# for each K-Means Cluster
##################################
cancer_death_rate_kmeans_clustering_descriptor['KMEANS_CLUSTER'] = np.where(cancer_death_rate_kmeans_clustering_descriptor['KMEANS_CLUSTER']== 0,'HIGH_PAN_LUN_COL_LIV_CAN','HIGH_PRO_BRE_CER_STO_ESO_CAN')
cancer_death_rate_kmeans_descriptor_clustered = cancer_death_rate_kmeans_clustering_descriptor.groupby('KMEANS_CLUSTER').mean()
display(cancer_death_rate_kmeans_descriptor_clustered)
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | |
|---|---|---|---|---|---|---|---|---|---|
| KMEANS_CLUSTER | |||||||||
| HIGH_PAN_LUN_COL_LIV_CAN | -0.4004 | -0.0894 | -0.7876 | -0.4930 | -0.4541 | 0.6040 | 0.7054 | 0.6445 | 0.0465 |
| HIGH_PRO_BRE_CER_STO_ESO_CAN | 0.3550 | 0.0793 | 0.6983 | 0.4371 | 0.4026 | -0.5355 | -0.6254 | -0.5714 | -0.0413 |
##################################
# Computing the average of the
# clustering descriptors
# for each K-Means Cluster
##################################
plt.figure(figsize=(10, 8))
sns.heatmap(cancer_death_rate_kmeans_descriptor_clustered, annot=True, cmap="seismic")
plt.xlabel('Cancer Types')
plt.ylabel('K-Means Clusters')
plt.title('Heatmap of Death Rates by Cancer Type and K-Means Clusters')
plt.show()
##################################
# Exploring the selected final model
# using the target descriptors
# and K-Means clusters
##################################
cancer_death_rate_kmeans_clustering_target = pd.concat([cancer_death_rate_kmeans_clustering[['KMEANS_CLUSTER']],cancer_death_rate_preprocessed[['SMPREV','OWPREV','ACSHAR']]], axis=1, join='inner')
cancer_death_rate_kmeans_clustering_target['KMEANS_CLUSTER'] = np.where(cancer_death_rate_kmeans_clustering_target['KMEANS_CLUSTER']== 0,'HIGH_PAN_LUN_COL_LIV_CAN','HIGH_PRO_BRE_CER_STO_ESO_CAN')
cancer_death_rate_kmeans_clustering_target.head()
| KMEANS_CLUSTER | SMPREV | OWPREV | ACSHAR | |
|---|---|---|---|---|
| 0 | HIGH_PRO_BRE_CER_STO_ESO_CAN | -0.5405 | -1.4979 | -1.6782 |
| 1 | HIGH_PAN_LUN_COL_LIV_CAN | 0.5329 | 0.6090 | 0.4008 |
| 2 | HIGH_PAN_LUN_COL_LIV_CAN | -0.6438 | 0.9033 | -1.3345 |
| 3 | HIGH_PAN_LUN_COL_LIV_CAN | 1.1517 | 1.0213 | 1.1371 |
| 4 | HIGH_PRO_BRE_CER_STO_ESO_CAN | -1.0431 | -1.2574 | 0.3520 |
##################################
# Computing the target descriptors
# for each K-Means Cluster
##################################
cancer_death_rate_kmeans_target_clustered = cancer_death_rate_kmeans_clustering_target.groupby('KMEANS_CLUSTER').mean()
display(cancer_death_rate_kmeans_target_clustered)
| SMPREV | OWPREV | ACSHAR | |
|---|---|---|---|
| KMEANS_CLUSTER | |||
| HIGH_PAN_LUN_COL_LIV_CAN | 0.6433 | 0.4329 | 0.3218 |
| HIGH_PRO_BRE_CER_STO_ESO_CAN | -0.5704 | -0.3838 | -0.2853 |
##################################
# Computing the average of the
# target descriptors
# for each K-Means Cluster
##################################
plt.figure(figsize=(10, 8))
sns.heatmap(cancer_death_rate_kmeans_target_clustered, annot=True, cmap="seismic")
plt.xlabel('Lifestyle Factors')
plt.ylabel('K-Means Clusters')
plt.title('Heatmap of Lifestyle Factors and K-Means Clusters')
plt.show()
##################################
# Exploring the selected final model
# using the location data
# and K-Means clusters
##################################
cancer_death_rate_kmeans_cluster_map = pd.concat([cancer_death_rate_kmeans_clustering_target[['KMEANS_CLUSTER']],cancer_death_rate_filtered_row[['CODE']]], axis=1, join='inner')
cancer_death_rate_kmeans_cluster_map.head()
| KMEANS_CLUSTER | CODE | |
|---|---|---|
| 0 | HIGH_PRO_BRE_CER_STO_ESO_CAN | AFG |
| 1 | HIGH_PAN_LUN_COL_LIV_CAN | ALB |
| 2 | HIGH_PAN_LUN_COL_LIV_CAN | DZA |
| 3 | HIGH_PAN_LUN_COL_LIV_CAN | AND |
| 4 | HIGH_PRO_BRE_CER_STO_ESO_CAN | AGO |
##################################
# Loading the globalmap shape file
##################################
world = gpd.read_file('custom.geo.json')
##################################
# Merging the GeoDataFrame
# with world map using country codes
##################################
world_cluster = world.merge(cancer_death_rate_kmeans_cluster_map, left_on='gu_a3', right_on='CODE', how='left')
##################################
# Plotting the map by K-Means cluster
##################################
fig, ax = plt.subplots(1, 1, figsize=(10, 6))
world_cluster.boundary.plot(ax=ax, linewidth=1)
world_cluster.plot(column='KMEANS_CLUSTER', cmap="seismic", legend=True, ax=ax, legend_kwds={"loc": "center left", "bbox_to_anchor": (1, 0.5)})
plt.title('KMEANS_CLUSTER')
plt.show()
##################################
# Plotting the map by K-Means descriptors
##################################
cancer_death_rate_kmeans_descriptor_map = pd.concat([cancer_death_rate_kmeans_clustering_descriptor,cancer_death_rate_filtered_row[['CODE']]], axis=1, join='inner')
cancer_death_rate_kmeans_descriptor_map.head()
| PROCAN | BRECAN | CERCAN | STOCAN | ESOCAN | PANCAN | LUNCAN | COLCAN | LIVCAN | KMEANS_CLUSTER | CODE | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.6922 | -0.4550 | -0.1771 | 2.0964 | 0.9425 | -1.4794 | -0.6095 | -0.9258 | 1.4059 | HIGH_PRO_BRE_CER_STO_ESO_CAN | AFG |
| 1 | -0.0867 | -1.3608 | -1.1020 | 0.3084 | -1.4329 | 0.2506 | 0.8754 | -0.7177 | 0.8924 | HIGH_PAN_LUN_COL_LIV_CAN | ALB |
| 2 | -1.0261 | -0.8704 | -0.8184 | -1.2331 | -1.8001 | -0.6858 | -0.9625 | -1.0428 | -1.1914 | HIGH_PAN_LUN_COL_LIV_CAN | DZA |
| 3 | 0.0691 | -0.3352 | -0.8866 | -0.1990 | 0.0272 | 1.2933 | 1.3658 | 1.5903 | 1.3091 | HIGH_PAN_LUN_COL_LIV_CAN | AND |
| 4 | 0.5801 | 0.3703 | 1.0686 | -0.0427 | 1.2150 | -1.1052 | -0.2718 | -0.5826 | -0.8379 | HIGH_PRO_BRE_CER_STO_ESO_CAN | AGO |
##################################
# Merging the GeoDataFrame
# with world map using country codes
##################################
world_descriptor = world.merge(cancer_death_rate_kmeans_descriptor_map, left_on='gu_a3', right_on='CODE', how='left')
##################################
# Plotting the map by Pancreatic Cancer Death Rate
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7))
world_descriptor.boundary.plot(ax=ax, linewidth=1)
world_descriptor.plot(column='PANCAN', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "PANCAN"})
plt.title('PANCAN')
plt.show()
##################################
# Plotting the map by Lung Cancer Death Rate
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7.5))
world_descriptor.boundary.plot(ax=ax, linewidth=1)
world_descriptor.plot(column='LUNCAN', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "LUNCAN"})
plt.title('LUNCAN')
plt.show()
##################################
# Plotting the map by Colon Cancer Death Rate
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7.5))
world_descriptor.boundary.plot(ax=ax, linewidth=1)
world_descriptor.plot(column='COLCAN', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "COLCAN"})
plt.title('COLCAN')
plt.show()
##################################
# Plotting the map by Liver Cancer Death Rate
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7.5))
world_descriptor.boundary.plot(ax=ax, linewidth=1)
world_descriptor.plot(column='LIVCAN', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "LIVCAN"})
plt.title('LIVCAN')
plt.show()
##################################
# Plotting the map by Prostate Cancer Death Rate
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7.5))
world_descriptor.boundary.plot(ax=ax, linewidth=1)
world_descriptor.plot(column='PROCAN', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "PROCAN"})
plt.title('PROCAN')
plt.show()
##################################
# Plotting the map by Breast Cancer Death Rate
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7.5))
world_descriptor.boundary.plot(ax=ax, linewidth=1)
world_descriptor.plot(column='BRECAN', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "BRECAN"})
plt.title('BRECAN')
plt.show()
##################################
# Plotting the map by Cervical Cancer Death Rate
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7.5))
world_descriptor.boundary.plot(ax=ax, linewidth=1)
world_descriptor.plot(column='CERCAN', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "CERCAN"})
plt.title('CERCAN')
plt.show()
##################################
# Plotting the map by Stomach Cancer Death Rate
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7.5))
world_descriptor.boundary.plot(ax=ax, linewidth=1)
world_descriptor.plot(column='STOCAN', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "STOCAN"})
plt.title('STOCAN')
plt.show()
##################################
# Plotting the map by Esophagus Cancer Death Rate
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7.5))
world_descriptor.boundary.plot(ax=ax, linewidth=1)
world_descriptor.plot(column='ESOCAN', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "ESOCAN"})
plt.title('ESOCAN')
plt.show()
##################################
# Plotting the map by K-Means target
##################################
cancer_death_rate_kmeans_target_map = pd.concat([cancer_death_rate_kmeans_clustering_target,cancer_death_rate_filtered_row[['CODE']]], axis=1, join='inner')
cancer_death_rate_kmeans_target_map.head()
| KMEANS_CLUSTER | SMPREV | OWPREV | ACSHAR | CODE | |
|---|---|---|---|---|---|
| 0 | HIGH_PRO_BRE_CER_STO_ESO_CAN | -0.5405 | -1.4979 | -1.6782 | AFG |
| 1 | HIGH_PAN_LUN_COL_LIV_CAN | 0.5329 | 0.6090 | 0.4008 | ALB |
| 2 | HIGH_PAN_LUN_COL_LIV_CAN | -0.6438 | 0.9033 | -1.3345 | DZA |
| 3 | HIGH_PAN_LUN_COL_LIV_CAN | 1.1517 | 1.0213 | 1.1371 | AND |
| 4 | HIGH_PRO_BRE_CER_STO_ESO_CAN | -1.0431 | -1.2574 | 0.3520 | AGO |
##################################
# Merging the GeoDataFrame
# with world map using country codes
##################################
world_target = world.merge(cancer_death_rate_kmeans_target_map, left_on='gu_a3', right_on='CODE', how='left')
##################################
# Plotting the map by Smoking Prevalence
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7))
world_target.boundary.plot(ax=ax, linewidth=1)
world_target.plot(column='SMPREV', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "SMPREV"})
plt.title('SMPREV')
plt.show()
##################################
# Plotting the map by Overweight Prevalence
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7))
world_target.boundary.plot(ax=ax, linewidth=1)
world_target.plot(column='OWPREV', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "OWPREV"})
plt.title('OWPREV')
plt.show()
##################################
# Plotting the map by Alcohol Consumption
##################################
fig, ax = plt.subplots(1, 1, figsize=(12.5, 7))
world_target.boundary.plot(ax=ax, linewidth=1)
world_target.plot(column='ACSHAR', cmap="seismic", legend=True, ax=ax, legend_kwds={'label': "ACSHAR"})
plt.title('ACSHAR')
plt.show()
A k-means clustering model with 2 clusters enabled the efficient segmentation of countries into distinct groups and provided a more granular view of relationships across different cancer mortality rates (pancreatic cancer, lung cancer, colorectal cancer, liver cancer, prostate cancer, breast cancer, cervical cancer, stomach cancer, esophageal cancer), lifestyle factors (smoking prevalence, overweight prevalence, alcohol consumption) and geographical information (region) – with characteristics described as follows:
Overall, disparities in cancer mortality were observed among groups of countries with contrasting lifestyle choices and geographic locations. Shared risk factors, socio-economic demographics and genetic predisposition may be potential drivers to the observed associations between death rates among certain cancer types. Unhealthy lifestyle choices, including smoking, obesity, and excessive alcohol consumption, may be well-established risk factors for cancer through direct and indirect effects on cellular processes, inflammation, DNA damage, and hormonal regulation. Additionally, the geographic locations of countries can have a substantial impact on cancer mortality rates driven by a complex interplay of various effects, encompassing differences in healthcare infrastructure, socio-economic conditions, cultural dissimilarities, environmental exposures, access to screening and early detection, among others.
From an initial dataset comprised of 208 observations and 16 descriptors, an optimal subset of 183 observations and 16 descriptors representing cancer mortality, lifestyle factors and geolocation descriptors were determined after conducting data quality assessment, excluding cases noted with irregularities and applying preprocessing operations most suitable for the downstream analysis. All data quality issues were addressed without the need to eliminate existing descriptors in the study
Multiple clustering modelling algorithms with various cluster counts were formulated using K-Means, Bisecting K-Means, Gaussian Mixture Model, Agglomerative and Ward Hierarchical methods. The best model with optimized hyperparameters from each algorithm was determined through internal resampling validation using 5-Fold Cross Validation using the Silhouette Score used as the primary performance metric. Due to the unsupervised learning nature of the analysis, all candidate models were compared based on internal validation and apparent performance.
The final model selected among candidates used K-Means Clustering with optimal hyperparameters: number of clusters to form and centroids to generate (n_clusters=2), number of times the k-means algorithm is run with different centroid seeds(n_init=auto equivalent to 1), method for initialization (init=k-means++ equivalent to selecting initial cluster centroids using sampling based on an empirical probability distribution of the points’ contribution to the overall inertia). This model demonstrated the best cross-validated (Silhouette Score=0.23) and apparent Silhouette Scores (Silhouette Score=0.24) under an assumption of 2 optimal clusters reflecting a moderate quality of the formulated clusters.
Post-hoc exploration of the model results involved clustering visualization methods using Pair Plots, Heat Maps and Geographic Maps - providing an intuitive method to investigate and understand the characteristics of the two discovered cancer clusters (Cluster 0 : HIGH_PAN_LUN_COL_LIV_CAN and Cluster 1: HIGH_PRO_BRE_CER_STO_ESO_CAN) across countries in terms of death rates, lifestyle factors and geolocation. These findings aided in the formulation of insights on the relationship and association of the various descriptors for the clusters identified.
from IPython.display import display, HTML
display(HTML("<style>.rendered_html { font-size: 15px; font-family: 'Trebuchet MS'; }</style>"))